Background

Open-Meteo maintains an API for historical weather that allows for non-commercial usage of historical weather data maintained by the website.

This file builds on _v001, _v002, _v003, _v004, and _v005 to run exploratory analysis on some historical weather data.

Functions and Libraries

The exploration process uses tidyverse, ranger, several generic custom functions, and several functions specific to Open Meteo processing. First, tidyverse, ranger, and the generic functions are loaded:

library(tidyverse) # tidyverse functionality is included throughout
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ranger) # predict() does not work on ranger objects unless ranger has been called

source("./Generic_Added_Utility_Functions_202105_v001.R") # Basic functions

Next, specific functions written in _v001, _v002, _v003, _v004, and _v005 are sourced:

source("./SimpleOneVar_Functions_202411_v001.R") # Functions for basic single variable analysis
source("./OpenMeteo_NextBest_202411_v001.R") # Functions for finding 'next best' predictor given existing model
source("./OpenMeteo_Functions_202411_v001.R") # Core functions for loading, processing, analysis of Open Meteo
source("./Generic_Analysis_Functions_202411_v001.R") # Additional functions for random forest and related analysis

Key mapping tables for available metrics are also copied:

hourlyMetrics <- "temperature_2m,relativehumidity_2m,dewpoint_2m,apparent_temperature,pressure_msl,surface_pressure,precipitation,rain,snowfall,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,shortwave_radiation,direct_radiation,direct_normal_irradiance,diffuse_radiation,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,windgusts_10m,et0_fao_evapotranspiration,weathercode,vapor_pressure_deficit,soil_temperature_0_to_7cm,soil_temperature_7_to_28cm,soil_temperature_28_to_100cm,soil_temperature_100_to_255cm,soil_moisture_0_to_7cm,soil_moisture_7_to_28cm,soil_moisture_28_to_100cm,soil_moisture_100_to_255cm"
dailyMetrics <- "weathercode,temperature_2m_max,temperature_2m_min,apparent_temperature_max,apparent_temperature_min,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,sunrise,sunset,windspeed_10m_max,windgusts_10m_max,winddirection_10m_dominant,shortwave_radiation_sum,et0_fao_evapotranspiration"

hourlyDescription <- "Air temperature at 2 meters above ground\nRelative humidity at 2 meters above ground\nDew point temperature at 2 meters above ground\nApparent temperature is the perceived feels-like temperature combining wind chill factor, relative humidity and solar radiation\nAtmospheric air pressure reduced to mean sea level (msl) or pressure at surface. Typically pressure on mean sea level is used in meteorology. Surface pressure gets lower with increasing elevation.\nAtmospheric air pressure reduced to mean sea level (msl) or pressure at surface. Typically pressure on mean sea level is used in meteorology. Surface pressure gets lower with increasing elevation.\nTotal precipitation (rain, showers, snow) sum of the preceding hour. Data is stored with a 0.1 mm precision. If precipitation data is summed up to monthly sums, there might be small inconsistencies with the total precipitation amount.\nOnly liquid precipitation of the preceding hour including local showers and rain from large scale systems.\nSnowfall amount of the preceding hour in centimeters. For the water equivalent in millimeter, divide by 7. E.g. 7 cm snow = 10 mm precipitation water equivalent\nTotal cloud cover as an area fraction\nLow level clouds and fog up to 2 km altitude\nMid level clouds from 2 to 6 km altitude\nHigh level clouds from 6 km altitude\nShortwave solar radiation as average of the preceding hour. This is equal to the total global horizontal irradiation\nDirect solar radiation as average of the preceding hour on the horizontal plane and the normal plane (perpendicular to the sun)\nDirect solar radiation as average of the preceding hour on the horizontal plane and the normal plane (perpendicular to the sun)\nDiffuse solar radiation as average of the preceding hour\nWind speed at 10 or 100 meters above ground. Wind speed on 10 meters is the standard level.\nWind speed at 10 or 100 meters above ground. Wind speed on 10 meters is the standard level.\nWind direction at 10 or 100 meters above ground\nWind direction at 10 or 100 meters above ground\nGusts at 10 meters above ground of the indicated hour. Wind gusts in CERRA are defined as the maximum wind gusts of the preceding hour. Please consult the ECMWF IFS documentation for more information on how wind gusts are parameterized in weather models.\nET0 Reference Evapotranspiration of a well watered grass field. Based on FAO-56 Penman-Monteith equations ET0 is calculated from temperature, wind speed, humidity and solar radiation. Unlimited soil water is assumed. ET0 is commonly used to estimate the required irrigation for plants.\nWeather condition as a numeric code. Follow WMO weather interpretation codes. See table below for details. Weather code is calculated from cloud cover analysis, precipitation and snowfall. As barely no information about atmospheric stability is available, estimation about thunderstorms is not possible.\nVapor Pressure Deificit (VPD) in kilopascal (kPa). For high VPD (>1.6), water transpiration of plants increases. For low VPD (<0.4), transpiration decreases\nAverage temperature of different soil levels below ground.\nAverage temperature of different soil levels below ground.\nAverage temperature of different soil levels below ground.\nAverage temperature of different soil levels below ground.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths."
dailyDescription <- "The most severe weather condition on a given day\nMaximum and minimum daily air temperature at 2 meters above ground\nMaximum and minimum daily air temperature at 2 meters above ground\nMaximum and minimum daily apparent temperature\nMaximum and minimum daily apparent temperature\nSum of daily precipitation (including rain, showers and snowfall)\nSum of daily rain\nSum of daily snowfall\nThe number of hours with rain\nSun rise and set times\nSun rise and set times\nMaximum wind speed and gusts on a day\nMaximum wind speed and gusts on a day\nDominant wind direction\nThe sum of solar radiaion on a given day in Megajoules\nDaily sum of ET0 Reference Evapotranspiration of a well watered grass field"

# Create tibble for hourly metrics
tblMetricsHourly <- tibble::tibble(metric=hourlyMetrics %>% str_split_1(","), 
                                   description=hourlyDescription %>% str_split_1("\n")
                                   )
tblMetricsHourly %>% 
    print(n=50)
## # A tibble: 33 × 2
##    metric                        description                                    
##    <chr>                         <chr>                                          
##  1 temperature_2m                Air temperature at 2 meters above ground       
##  2 relativehumidity_2m           Relative humidity at 2 meters above ground     
##  3 dewpoint_2m                   Dew point temperature at 2 meters above ground 
##  4 apparent_temperature          Apparent temperature is the perceived feels-li…
##  5 pressure_msl                  Atmospheric air pressure reduced to mean sea l…
##  6 surface_pressure              Atmospheric air pressure reduced to mean sea l…
##  7 precipitation                 Total precipitation (rain, showers, snow) sum …
##  8 rain                          Only liquid precipitation of the preceding hou…
##  9 snowfall                      Snowfall amount of the preceding hour in centi…
## 10 cloudcover                    Total cloud cover as an area fraction          
## 11 cloudcover_low                Low level clouds and fog up to 2 km altitude   
## 12 cloudcover_mid                Mid level clouds from 2 to 6 km altitude       
## 13 cloudcover_high               High level clouds from 6 km altitude           
## 14 shortwave_radiation           Shortwave solar radiation as average of the pr…
## 15 direct_radiation              Direct solar radiation as average of the prece…
## 16 direct_normal_irradiance      Direct solar radiation as average of the prece…
## 17 diffuse_radiation             Diffuse solar radiation as average of the prec…
## 18 windspeed_10m                 Wind speed at 10 or 100 meters above ground. W…
## 19 windspeed_100m                Wind speed at 10 or 100 meters above ground. W…
## 20 winddirection_10m             Wind direction at 10 or 100 meters above ground
## 21 winddirection_100m            Wind direction at 10 or 100 meters above ground
## 22 windgusts_10m                 Gusts at 10 meters above ground of the indicat…
## 23 et0_fao_evapotranspiration    ET0 Reference Evapotranspiration of a well wat…
## 24 weathercode                   Weather condition as a numeric code. Follow WM…
## 25 vapor_pressure_deficit        Vapor Pressure Deificit (VPD) in kilopascal (k…
## 26 soil_temperature_0_to_7cm     Average temperature of different soil levels b…
## 27 soil_temperature_7_to_28cm    Average temperature of different soil levels b…
## 28 soil_temperature_28_to_100cm  Average temperature of different soil levels b…
## 29 soil_temperature_100_to_255cm Average temperature of different soil levels b…
## 30 soil_moisture_0_to_7cm        Average soil water content as volumetric mixin…
## 31 soil_moisture_7_to_28cm       Average soil water content as volumetric mixin…
## 32 soil_moisture_28_to_100cm     Average soil water content as volumetric mixin…
## 33 soil_moisture_100_to_255cm    Average soil water content as volumetric mixin…
# Create tibble for daily metrics
tblMetricsDaily <- tibble::tibble(metric=dailyMetrics %>% str_split_1(","), 
                                  description=dailyDescription %>% str_split_1("\n")
                                   )
tblMetricsDaily
## # A tibble: 16 × 2
##    metric                     description                                       
##    <chr>                      <chr>                                             
##  1 weathercode                The most severe weather condition on a given day  
##  2 temperature_2m_max         Maximum and minimum daily air temperature at 2 me…
##  3 temperature_2m_min         Maximum and minimum daily air temperature at 2 me…
##  4 apparent_temperature_max   Maximum and minimum daily apparent temperature    
##  5 apparent_temperature_min   Maximum and minimum daily apparent temperature    
##  6 precipitation_sum          Sum of daily precipitation (including rain, showe…
##  7 rain_sum                   Sum of daily rain                                 
##  8 snowfall_sum               Sum of daily snowfall                             
##  9 precipitation_hours        The number of hours with rain                     
## 10 sunrise                    Sun rise and set times                            
## 11 sunset                     Sun rise and set times                            
## 12 windspeed_10m_max          Maximum wind speed and gusts on a day             
## 13 windgusts_10m_max          Maximum wind speed and gusts on a day             
## 14 winddirection_10m_dominant Dominant wind direction                           
## 15 shortwave_radiation_sum    The sum of solar radiaion on a given day in Megaj…
## 16 et0_fao_evapotranspiration Daily sum of ET0 Reference Evapotranspiration of …

A previously existing hourly dataset is loaded, with key analysis variables defined in a vector:

# Load previous data
allCityHourly <- readFromRDS("allCity_20241116")

# Get core training variables
varsTrainHourly <- getVarsTrain(allCityHourly)
varsTrainHourly
##  [1] "hour"                          "temperature_2m"               
##  [3] "relativehumidity_2m"           "dewpoint_2m"                  
##  [5] "apparent_temperature"          "pressure_msl"                 
##  [7] "surface_pressure"              "precipitation"                
##  [9] "rain"                          "snowfall"                     
## [11] "cloudcover"                    "cloudcover_low"               
## [13] "cloudcover_mid"                "cloudcover_high"              
## [15] "shortwave_radiation"           "direct_radiation"             
## [17] "direct_normal_irradiance"      "diffuse_radiation"            
## [19] "windspeed_10m"                 "windspeed_100m"               
## [21] "winddirection_10m"             "winddirection_100m"           
## [23] "windgusts_10m"                 "et0_fao_evapotranspiration"   
## [25] "weathercode"                   "vapor_pressure_deficit"       
## [27] "soil_temperature_0_to_7cm"     "soil_temperature_7_to_28cm"   
## [29] "soil_temperature_28_to_100cm"  "soil_temperature_100_to_255cm"
## [31] "soil_moisture_0_to_7cm"        "soil_moisture_7_to_28cm"      
## [33] "soil_moisture_28_to_100cm"     "soil_moisture_100_to_255cm"   
## [35] "year"                          "doy"
# Assign default label
keyLabel <- genericKeyLabelOM()
keyLabel
## [1] "predictions based on pre-2022 training data applied to 2022 holdout dataset"
# Get years of data available
allCityHourly %>% 
    mutate(year=year(date)) %>%
    group_by(src) %>%
    summarize(n=n(), minYear=min(year), maxYear=max(year), minDate=min(date), maxDate=max(date))
## # A tibble: 7 × 6
##   src          n minYear maxYear minDate    maxDate   
##   <chr>    <int>   <dbl>   <dbl> <date>     <date>    
## 1 Boston  122712    2010    2023 2010-01-01 2023-12-31
## 2 Chicago 122712    2010    2023 2010-01-01 2023-12-31
## 3 Houston 122712    2010    2023 2010-01-01 2023-12-31
## 4 LA      122712    2010    2023 2010-01-01 2023-12-31
## 5 Miami   122712    2010    2023 2010-01-01 2023-12-31
## 6 NYC     117936    2010    2023 2010-01-01 2023-06-15
## 7 Vegas   122712    2010    2023 2010-01-01 2023-12-31

A daily dataset is created from existing data:

omCityMap <- c("bos"="Boston MA", 
               "ord"="Chicago IL", 
               "hou"="Houston TX",
               "lax"="Los Angeles CA", 
               "las"="Las Vegas NV",
               "mia"="Miami FL", 
               "nyc"="New York NY"
               )

allCityDaily <- map_dfr(.x=names(omCityMap), 
                        .f=function(x) omRunAllSteps(dlData=FALSE, runStats=FALSE, abbCity=x, returnDF=TRUE), 
                        .id="src"
                        ) %>%
    mutate(cityName=unname(omCityMap)[as.integer(src)])
## 
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily 
## 
## Rows: 5,113
## Columns: 21
## $ date                       <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time                       <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode                <fct> 71, 73, 73, 3, 3, 3, 1, 71, 1, 0, 3, 3, 2, …
## $ temperature_2m_max         <dbl> 3.4, 0.5, -1.5, 0.0, -0.4, -0.7, 1.6, -2.0,…
## $ temperature_2m_min         <dbl> -2.4, -5.6, -8.8, -5.0, -5.9, -5.3, -4.9, -…
## $ apparent_temperature_max   <dbl> -0.2, -4.1, -8.5, -5.8, -5.2, -6.7, -3.6, -…
## $ apparent_temperature_min   <dbl> -6.6, -13.9, -17.0, -10.9, -11.3, -10.9, -1…
## $ precipitation_sum          <dbl> 0.5, 8.9, 6.7, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0…
## $ rain_sum                   <dbl> 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum               <dbl> 0.49, 5.53, 5.04, 0.00, 0.00, 0.00, 0.00, 0…
## $ precipitation_hours        <dbl> 4, 24, 20, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,…
## $ sunrise                    <dttm> 2010-01-01 07:13:00, 2010-01-02 07:13:00, …
## $ sunset                     <dttm> 2010-01-01 16:22:00, 2010-01-02 16:23:00, …
## $ windspeed_10m_max          <dbl> 13.1, 34.7, 34.8, 28.1, 17.1, 20.7, 21.1, 2…
## $ windgusts_10m_max          <dbl> 23.4, 64.8, 68.0, 50.8, 30.6, 38.9, 38.5, 3…
## $ winddirection_10m_dominant <int> 332, 334, 304, 304, 299, 302, 292, 319, 322…
## $ shortwave_radiation_sum    <dbl> 6.58, 3.86, 4.04, 7.66, 7.55, 8.26, 8.82, 6…
## $ et0_fao_evapotranspiration <dbl> 0.66, 0.65, 0.69, 0.87, 0.76, 0.93, 1.01, 0…
## $ sunrise_chr                <chr> "2010-01-01T07:13", "2010-01-02T07:13", "20…
## $ sunset_chr                 <chr> "2010-01-01T16:22", "2010-01-02T16:23", "20…
## $ fct_winddir                <fct> 332, 334, 304, 304, 299, 302, 292, 319, 322…
## 
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily 
## 
## Rows: 23,376
## Columns: 21
## $ date                       <date> 1960-01-01, 1960-01-02, 1960-01-03, 1960-0…
## $ time                       <chr> "1960-01-01", "1960-01-02", "1960-01-03", "…
## $ weathercode                <fct> 3, 51, 3, 2, 3, 1, 3, 3, 51, 51, 61, 63, 53…
## $ temperature_2m_max         <dbl> 1.6, 4.9, -0.6, -5.9, -6.1, 1.0, 4.9, 1.6, …
## $ temperature_2m_min         <dbl> -3.5, -0.9, -7.5, -11.8, -11.2, -10.3, -3.2…
## $ apparent_temperature_max   <dbl> -4.3, -0.6, -6.9, -13.1, -13.6, -4.4, -1.3,…
## $ apparent_temperature_min   <dbl> -7.9, -7.0, -14.4, -18.4, -17.7, -17.9, -8.…
## $ precipitation_sum          <dbl> 0.0, 1.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6…
## $ rain_sum                   <dbl> 0.0, 1.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6…
## $ snowfall_sum               <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ precipitation_hours        <dbl> 0, 11, 0, 0, 0, 0, 0, 0, 6, 1, 5, 24, 5, 10…
## $ sunrise                    <dttm> 1960-01-01 07:18:00, 1960-01-02 07:18:00, …
## $ sunset                     <dttm> 1960-01-01 16:29:00, 1960-01-02 16:30:00, …
## $ windspeed_10m_max          <dbl> 22.7, 27.6, 27.0, 29.1, 29.6, 32.1, 32.7, 3…
## $ windgusts_10m_max          <dbl> 40.0, 52.6, 44.6, 50.8, 48.2, 52.6, 56.5, 5…
## $ winddirection_10m_dominant <int> 142, 214, 268, 247, 261, 232, 234, 275, 185…
## $ shortwave_radiation_sum    <dbl> 7.45, 2.25, 4.58, 8.66, 9.09, 8.79, 5.86, 8…
## $ et0_fao_evapotranspiration <dbl> 0.95, 0.66, 1.06, 1.06, 1.04, 1.33, 1.23, 1…
## $ sunrise_chr                <chr> "1960-01-01T07:18", "1960-01-02T07:18", "19…
## $ sunset_chr                 <chr> "1960-01-01T16:29", "1960-01-02T16:30", "19…
## $ fct_winddir                <fct> 142, 214, 268, 247, 261, 232, 234, 275, 185…
## 
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily 
## 
## Rows: 5,113
## Columns: 21
## $ date                       <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time                       <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode                <fct> 3, 1, 3, 3, 0, 51, 55, 2, 0, 0, 1, 1, 53, 6…
## $ temperature_2m_max         <dbl> 11.8, 12.0, 10.0, 7.6, 8.0, 12.7, 13.4, 0.8…
## $ temperature_2m_min         <dbl> 3.9, 0.7, 4.4, 1.8, -1.9, -0.1, -0.2, -3.0,…
## $ apparent_temperature_max   <dbl> 8.3, 8.8, 7.2, 4.5, 5.1, 10.9, 12.5, -5.7, …
## $ apparent_temperature_min   <dbl> 1.0, -2.6, 1.2, -2.2, -5.5, -3.6, -6.8, -9.…
## $ precipitation_sum          <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 1.1, 6.4, 0.0, 0.0…
## $ rain_sum                   <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 1.1, 6.4, 0.0, 0.0…
## $ snowfall_sum               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours        <dbl> 0, 0, 0, 0, 0, 4, 10, 0, 0, 0, 0, 0, 5, 17,…
## $ sunrise                    <dttm> 2010-01-01 08:17:00, 2010-01-02 08:17:00, …
## $ sunset                     <dttm> 2010-01-01 18:33:00, 2010-01-02 18:34:00, …
## $ windspeed_10m_max          <dbl> 25.9, 11.2, 14.4, 21.6, 8.8, 17.4, 32.9, 22…
## $ windgusts_10m_max          <dbl> 46.8, 23.4, 28.1, 42.1, 20.2, 33.8, 59.0, 4…
## $ winddirection_10m_dominant <int> 350, 89, 65, 344, 15, 126, 345, 345, 353, 7…
## $ shortwave_radiation_sum    <dbl> 11.87, 13.86, 4.76, 13.82, 14.46, 4.55, 6.9…
## $ et0_fao_evapotranspiration <dbl> 1.69, 1.88, 1.03, 1.74, 1.55, 0.91, 1.20, 1…
## $ sunrise_chr                <chr> "2010-01-01T08:17", "2010-01-02T08:17", "20…
## $ sunset_chr                 <chr> "2010-01-01T18:33", "2010-01-02T18:34", "20…
## $ fct_winddir                <fct> 350, 89, 65, 344, 15, 126, 345, 345, 353, 7…
## 
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily 
## 
## Rows: 5,113
## Columns: 21
## $ date                       <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time                       <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode                <fct> 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 61, 0, …
## $ temperature_2m_max         <dbl> 20.1, 23.2, 23.0, 22.1, 22.9, 23.2, 23.3, 2…
## $ temperature_2m_min         <dbl> 4.7, 6.7, 6.5, 6.5, 5.0, 7.7, 5.2, 8.4, 7.2…
## $ apparent_temperature_max   <dbl> 17.5, 20.6, 19.9, 18.3, 19.8, 19.8, 20.2, 1…
## $ apparent_temperature_min   <dbl> 0.9, 2.9, 2.7, 2.3, 0.8, 4.0, 1.6, 5.0, 3.5…
## $ precipitation_sum          <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ rain_sum                   <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0…
## $ sunrise                    <dttm> 2010-01-01 07:59:00, 2010-01-02 08:00:00, …
## $ sunset                     <dttm> 2010-01-01 17:55:00, 2010-01-02 17:56:00, …
## $ windspeed_10m_max          <dbl> 10.3, 13.0, 12.4, 11.4, 11.5, 9.8, 8.8, 12.…
## $ windgusts_10m_max          <dbl> 23.0, 32.0, 31.3, 25.9, 24.8, 19.8, 18.4, 3…
## $ winddirection_10m_dominant <int> 7, 9, 20, 13, 9, 20, 22, 24, 17, 11, 26, 17…
## $ shortwave_radiation_sum    <dbl> 8.99, 12.29, 11.80, 10.71, 12.54, 12.24, 12…
## $ et0_fao_evapotranspiration <dbl> 1.97, 2.67, 2.95, 2.74, 2.58, 2.60, 2.31, 2…
## $ sunrise_chr                <chr> "2010-01-01T07:59", "2010-01-02T08:00", "20…
## $ sunset_chr                 <chr> "2010-01-01T17:55", "2010-01-02T17:56", "20…
## $ fct_winddir                <fct> 7, 9, 20, 13, 9, 20, 22, 24, 17, 11, 26, 17…
## 
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily 
## 
## Rows: 5,113
## Columns: 21
## $ date                       <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time                       <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode                <fct> 2, 0, 0, 1, 1, 1, 2, 1, 1, 2, 1, 2, 51, 1, …
## $ temperature_2m_max         <dbl> 10.3, 14.2, 14.2, 13.3, 13.6, 15.8, 16.1, 1…
## $ temperature_2m_min         <dbl> -1.3, -0.4, 0.7, 2.8, 0.7, 2.5, 6.0, 1.2, 0…
## $ apparent_temperature_max   <dbl> 6.3, 10.7, 9.7, 9.1, 9.7, 12.0, 12.0, 6.6, …
## $ apparent_temperature_min   <dbl> -4.9, -3.6, -2.8, -0.8, -3.2, -0.9, 3.0, -2…
## $ precipitation_sum          <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ rain_sum                   <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0…
## $ sunrise                    <dttm> 2010-01-01 07:51:00, 2010-01-02 07:52:00, …
## $ sunset                     <dttm> 2010-01-01 17:36:00, 2010-01-02 17:37:00, …
## $ windspeed_10m_max          <dbl> 8.4, 5.4, 9.7, 8.2, 6.4, 6.3, 13.7, 9.7, 5.…
## $ windgusts_10m_max          <dbl> 18.0, 16.6, 24.1, 22.0, 18.7, 16.6, 32.0, 2…
## $ winddirection_10m_dominant <int> 327, 332, 341, 338, 317, 311, 3, 3, 317, 34…
## $ shortwave_radiation_sum    <dbl> 10.98, 11.85, 11.72, 11.04, 11.88, 11.74, 1…
## $ et0_fao_evapotranspiration <dbl> 1.55, 1.62, 1.91, 1.84, 1.78, 1.80, 2.22, 1…
## $ sunrise_chr                <chr> "2010-01-01T07:51", "2010-01-02T07:52", "20…
## $ sunset_chr                 <chr> "2010-01-01T17:36", "2010-01-02T17:37", "20…
## $ fct_winddir                <fct> 327, 332, 341, 338, 317, 311, 3, 3, 317, 34…
## 
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily 
## 
## Rows: 5,113
## Columns: 21
## $ date                       <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time                       <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode                <fct> 53, 1, 51, 3, 3, 1, 1, 2, 61, 3, 2, 3, 1, 3…
## $ temperature_2m_max         <dbl> 26.6, 18.0, 16.7, 15.5, 14.9, 13.8, 16.6, 2…
## $ temperature_2m_min         <dbl> 17.5, 11.6, 11.3, 9.0, 9.4, 6.3, 8.6, 11.6,…
## $ apparent_temperature_max   <dbl> 25.7, 13.8, 13.6, 10.8, 10.8, 9.2, 14.3, 20…
## $ apparent_temperature_min   <dbl> 14.5, 6.9, 6.6, 4.0, 5.0, 0.3, 3.9, 10.2, 1…
## $ precipitation_sum          <dbl> 2.3, 0.0, 0.9, 0.0, 0.0, 0.0, 0.0, 0.0, 6.6…
## $ rain_sum                   <dbl> 2.3, 0.0, 0.9, 0.0, 0.0, 0.0, 0.0, 0.0, 6.6…
## $ snowfall_sum               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours        <dbl> 6, 0, 4, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, …
## $ sunrise                    <dttm> 2010-01-01 08:07:00, 2010-01-02 08:07:00, …
## $ sunset                     <dttm> 2010-01-01 18:41:00, 2010-01-02 18:42:00, …
## $ windspeed_10m_max          <dbl> 31.3, 29.2, 22.6, 21.0, 23.7, 27.4, 18.6, 1…
## $ windgusts_10m_max          <dbl> 50.4, 46.8, 32.4, 36.7, 41.0, 42.1, 29.5, 2…
## $ winddirection_10m_dominant <int> 263, 350, 343, 324, 321, 329, 345, 247, 334…
## $ shortwave_radiation_sum    <dbl> 11.06, 15.60, 12.41, 16.22, 14.96, 16.69, 1…
## $ et0_fao_evapotranspiration <dbl> 2.80, 3.50, 3.54, 3.76, 2.96, 3.13, 3.27, 2…
## $ sunrise_chr                <chr> "2010-01-01T08:07", "2010-01-02T08:07", "20…
## $ sunset_chr                 <chr> "2010-01-01T18:41", "2010-01-02T18:42", "20…
## $ fct_winddir                <fct> 263, 350, 343, 324, 321, 329, 345, 247, 334…
## 
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily 
## 
## Rows: 4,914
## Columns: 21
## $ date                       <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time                       <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode                <fct> 73, 71, 71, 1, 1, 2, 2, 71, 0, 0, 2, 2, 3, …
## $ temperature_2m_max         <dbl> 5.0, -0.6, -4.8, -0.8, -0.2, 1.1, 3.6, 1.9,…
## $ temperature_2m_min         <dbl> -1.4, -9.2, -10.0, -7.3, -7.3, -5.3, -3.7, …
## $ apparent_temperature_max   <dbl> 2.2, -6.0, -12.5, -6.6, -6.0, -5.0, -1.2, -…
## $ apparent_temperature_min   <dbl> -5.6, -16.9, -17.6, -13.5, -12.4, -10.8, -9…
## $ precipitation_sum          <dbl> 1.8, 0.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7, 0.0…
## $ rain_sum                   <dbl> 0.3, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum               <dbl> 1.05, 0.70, 0.28, 0.00, 0.00, 0.00, 0.00, 0…
## $ precipitation_hours        <dbl> 5, 4, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0…
## $ sunrise                    <dttm> 2010-01-01 08:19:00, 2010-01-02 08:19:00, …
## $ sunset                     <dttm> 2010-01-01 17:39:00, 2010-01-02 17:40:00, …
## $ windspeed_10m_max          <dbl> 11.0, 27.2, 35.3, 20.5, 17.8, 20.2, 18.1, 2…
## $ windgusts_10m_max          <dbl> 18.4, 53.3, 61.2, 43.6, 34.9, 39.2, 32.8, 3…
## $ winddirection_10m_dominant <int> 297, 307, 297, 301, 296, 301, 295, 292, 320…
## $ shortwave_radiation_sum    <dbl> 5.34, 6.81, 5.64, 8.88, 9.33, 8.95, 9.53, 5…
## $ et0_fao_evapotranspiration <dbl> 0.56, 1.06, 1.09, 1.13, 1.02, 1.18, 1.11, 0…
## $ sunrise_chr                <chr> "2010-01-01T08:19", "2010-01-02T08:19", "20…
## $ sunset_chr                 <chr> "2010-01-01T17:39", "2010-01-02T17:40", "20…
## $ fct_winddir                <fct> 297, 307, 297, 301, 296, 301, 295, 292, 320…
allCityDaily %>% 
    mutate(year=year(date)) %>%
    group_by(cityName) %>%
    summarize(n=n(), minYear=min(year), maxYear=max(year), minDate=min(date), maxDate=max(date))
## # A tibble: 7 × 6
##   cityName           n minYear maxYear minDate    maxDate   
##   <chr>          <int>   <dbl>   <dbl> <date>     <date>    
## 1 Boston MA       5113    2010    2023 2010-01-01 2023-12-31
## 2 Chicago IL     23376    1960    2023 1960-01-01 2023-12-31
## 3 Houston TX      5113    2010    2023 2010-01-01 2023-12-31
## 4 Las Vegas NV    5113    2010    2023 2010-01-01 2023-12-31
## 5 Los Angeles CA  5113    2010    2023 2010-01-01 2023-12-31
## 6 Miami FL        5113    2010    2023 2010-01-01 2023-12-31
## 7 New York NY     4914    2010    2023 2010-01-01 2023-06-15

Hourly data is updated with city name to match daily data:

hourlyCityMap <- c("Boston"="Boston MA", 
                   "Chicago"="Chicago IL", 
                   "Houston"="Houston TX",
                   "LA"="Los Angeles CA", 
                   "Vegas"="Las Vegas NV",
                   "Miami"="Miami FL", 
                   "NYC"="New York NY"
                   )

allCityHourly <- allCityHourly %>%
    mutate(cityName=hourlyCityMap[src]) 

allCityHourly %>%
    group_by(src, cityName) %>%
    summarize(n=n(), minDate=min(date), maxDate=max(date), .groups="drop")
## # A tibble: 7 × 5
##   src     cityName            n minDate    maxDate   
##   <chr>   <chr>           <int> <date>     <date>    
## 1 Boston  Boston MA      122712 2010-01-01 2023-12-31
## 2 Chicago Chicago IL     122712 2010-01-01 2023-12-31
## 3 Houston Houston TX     122712 2010-01-01 2023-12-31
## 4 LA      Los Angeles CA 122712 2010-01-01 2023-12-31
## 5 Miami   Miami FL       122712 2010-01-01 2023-12-31
## 6 NYC     New York NY    117936 2010-01-01 2023-06-15
## 7 Vegas   Las Vegas NV   122712 2010-01-01 2023-12-31

Alignment of maximum temperature is explored:

maxTDelta <- allCityHourly %>%
    group_by(cityName, date) %>%
    summarize(maxT=max(temperature_2m), .groups="drop") %>%
    filter(year(date)<=2022) %>%
    left_join(select(allCityDaily, cityName, date, temperature_2m_max), by=c("cityName", "date")) %>%
    mutate(delta=maxT-temperature_2m_max)

maxTDelta %>%
    summary()
##    cityName              date                 maxT        temperature_2m_max
##  Length:33236       Min.   :2010-01-01   Min.   :-22.50   Min.   :-22.80    
##  Class :character   1st Qu.:2013-04-01   1st Qu.: 14.70   1st Qu.: 14.70    
##  Mode  :character   Median :2016-07-01   Median : 23.60   Median : 23.60    
##                     Mean   :2016-07-01   Mean   : 21.34   Mean   : 21.33    
##                     3rd Qu.:2019-10-01   3rd Qu.: 28.80   3rd Qu.: 28.80    
##                     Max.   :2022-12-31   Max.   : 45.90   Max.   : 45.90    
##      delta          
##  Min.   :-2.700000  
##  1st Qu.: 0.000000  
##  Median : 0.000000  
##  Mean   : 0.005277  
##  3rd Qu.: 0.000000  
##  Max.   : 5.900000
maxTDelta %>%
    group_by(cityName) %>%
    summarize(pctDiff=mean(abs(delta)>0.01))
## # A tibble: 7 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA        0    
## 2 Chicago IL       0.104
## 3 Houston TX       0    
## 4 Las Vegas NV     0    
## 5 Los Angeles CA   0    
## 6 Miami FL         0    
## 7 New York NY      0

10% of observations in Chicago differ, potentially due to time zone conversions. All other cities align

The differences in Chicago observations are further explored:

maxTDelta %>%
    filter(cityName=="Chicago IL") %>%
    mutate(mon=month(date), year=year(date), isDiff=abs(delta)>0.01) %>%
    group_by(year, mon) %>%
    summarize(pctDiff=mean(isDiff), n=n(), .groups="drop") %>%
    ggplot(aes(x=factor(month.abb[mon], levels=month.abb), y=pctDiff, group=1)) + 
    geom_line() + 
    geom_hline(data=~summarize(., yint=sum(n*pctDiff)/sum(n)), aes(yintercept=yint), lty=2) +
    facet_wrap(~year) + 
    labs(x=NULL, y="Proportion mismatched", title="Mismatches between daily and hourly maximum temperature")

In general, mismatches seem to be more common in the colder season

A sample of the larger mismatches are explored:

tmpDates <- maxTDelta %>%
    filter(cityName=="Chicago IL") %>%
    mutate(mon=month(date), year=year(date), isDiff=abs(delta)>2.5) %>%
    filter(isDiff) %>%
    pull(date)
tmpDates
## [1] "2013-01-30" "2013-04-17" "2013-12-05" "2016-04-04" "2017-12-05"
## [6] "2018-05-20" "2019-01-20" "2020-12-24" "2022-02-17"
tmpDF <- allCityHourly %>%
    filter(cityName=="Chicago IL") %>%
    select(cityName, time, date, hour, temperature_2m)

map_dfr(.x=tmpDates, 
        .f=function(x) tmpDF %>% filter(date %in% c(x, x-1, x+1)), 
        .id="src"
        ) %>%
    mutate(src=tmpDates[as.integer(src)], isSame=(src==date)) %>%
    left_join(select(maxTDelta, cityName, src=date, temperature_2m_max), by=c("cityName", "src")) %>%
    ggplot(aes(x=time, y=temperature_2m)) +
    geom_line() + 
    geom_line(aes(y=temperature_2m_max), lty=2) +
    facet_wrap(~src, scales="free") + 
    labs(x=NULL, y="Temperature (C)", title="Hourly temperatures near key disconnect dates")

Disconnects appear to arise when the maximum temperature is reached very early and/or very late in the day. This is consistent with time zones potentially driving the differences.

Shading is added for clarity:

map_dfr(.x=tmpDates, 
        .f=function(x) tmpDF %>% filter(date %in% c(x, x-1, x+1)), 
        .id="src"
        ) %>%
    mutate(src=tmpDates[as.integer(src)], isSame=(src==date)) %>%
    left_join(select(maxTDelta, cityName, src=date, temperature_2m_max), by=c("cityName", "src")) %>%
    ggplot(aes(x=time, y=temperature_2m)) +
    geom_line() + 
    geom_line(aes(y=temperature_2m_max), lty=2) +
    facet_wrap(~src, scales="free") + 
    labs(x=NULL, 
         y="Temperature (C)", 
         title="Hourly temperatures near key disconnect dates", 
         subtitle="Line is hourly data, shaded region is the 24 hours in the key date", 
         caption="Dotted line is maximum temperature reported in daily data"
         ) +
    geom_tile(data=~filter(., isSame), fill="lightblue", aes(x=time, y=0, height=+Inf), alpha=0.5)

Chicago data are shifted by an hour to check whether time zones fully explain the differences:

# Run previously
# tmpDF <- allCityHourly %>%
#     filter(cityName=="Chicago IL") %>%
#     select(cityName, time, date, hour, temperature_2m)

tmpDF %>% 
    mutate(newtime=time-hours(1), date=date(newtime)) %>% 
    filter(!(date %in% c(min(date)))) %>% 
    group_by(cityName, date) %>% 
    summarize(maxT=max(temperature_2m), .groups="drop") %>% 
    filter(year(date)<=2022) %>% 
    left_join(select(allCityDaily, cityName, date, temperature_2m_max), by=c("cityName", "date")) %>%
    mutate(delta=maxT-temperature_2m_max) %>% 
    summary()
##    cityName              date                 maxT        temperature_2m_max
##  Length:4748        Min.   :2010-01-01   Min.   :-22.80   Min.   :-22.80    
##  Class :character   1st Qu.:2013-04-01   1st Qu.:  4.90   1st Qu.:  4.90    
##  Mode  :character   Median :2016-07-01   Median : 14.90   Median : 14.90    
##                     Mean   :2016-07-01   Mean   : 14.28   Mean   : 14.28    
##                     3rd Qu.:2019-10-01   3rd Qu.: 24.20   3rd Qu.: 24.20    
##                     Max.   :2022-12-31   Max.   : 37.40   Max.   : 37.40    
##      delta  
##  Min.   :0  
##  1st Qu.:0  
##  Median :0  
##  Mean   :0  
##  3rd Qu.:0  
##  Max.   :0

If date is recalculated based on time minus one hour, the Chicago hourly data aligns with the Chicago daily data for maximum temperature

Alignment of precipitation is explored, first with time zones in hourly data “as is”:

sumPDelta <- allCityHourly %>%
    group_by(cityName, date) %>%
    summarize(sumP=sum(precipitation), .groups="drop") %>%
    filter(year(date)<=2022) %>%
    left_join(select(allCityDaily, cityName, date, precipitation_sum), by=c("cityName", "date")) %>%
    mutate(delta=sumP-precipitation_sum)

sumPDelta %>%
    summary()
##    cityName              date                 sumP         precipitation_sum
##  Length:33236       Min.   :2010-01-01   Min.   :  0.000   Min.   :  0.000  
##  Class :character   1st Qu.:2013-04-01   1st Qu.:  0.000   1st Qu.:  0.000  
##  Mode  :character   Median :2016-07-01   Median :  0.000   Median :  0.000  
##                     Mean   :2016-07-01   Mean   :  2.575   Mean   :  2.575  
##                     3rd Qu.:2019-10-01   3rd Qu.:  1.500   3rd Qu.:  1.600  
##                     Max.   :2022-12-31   Max.   :143.900   Max.   :143.900  
##      delta      
##  Min.   :-14.7  
##  1st Qu.:  0.0  
##  Median :  0.0  
##  Mean   :  0.0  
##  3rd Qu.:  0.0  
##  Max.   : 14.7
sumPDelta %>%
    group_by(cityName) %>%
    summarize(pctDiff=mean(abs(delta)>0.005))
## # A tibble: 7 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA        0    
## 2 Chicago IL       0.259
## 3 Houston TX       0    
## 4 Las Vegas NV     0    
## 5 Los Angeles CA   0    
## 6 Miami FL         0    
## 7 New York NY      0

25% of observations in Chicago differ, likely due to time zone conversions. All other cities align

Chicago data is updated to be one hour earlier:

sumPDelta <- allCityHourly %>%
    mutate(date=as.Date(ifelse(cityName %in% c("Chicago IL"), time-hours(1), date))) %>%
    group_by(cityName, date) %>%
    summarize(sumP=sum(precipitation), .groups="drop") %>%
    filter(year(date)<=2022) %>%
    left_join(select(allCityDaily, cityName, date, precipitation_sum), by=c("cityName", "date")) %>%
    mutate(delta=sumP-precipitation_sum)

sumPDelta %>%
    summary()
##    cityName              date                 sumP         precipitation_sum
##  Length:28488       Min.   :2010-01-01   Min.   :  0.000   Min.   :  0.000  
##  Class :character   1st Qu.:2013-04-01   1st Qu.:  0.000   1st Qu.:  0.000  
##  Mode  :character   Median :2016-07-01   Median :  0.000   Median :  0.000  
##                     Mean   :2016-07-01   Mean   :  2.499   Mean   :  2.499  
##                     3rd Qu.:2019-10-01   3rd Qu.:  1.400   3rd Qu.:  1.400  
##                     Max.   :2022-12-31   Max.   :143.900   Max.   :143.900  
##      delta           
##  Min.   :-1.421e-14  
##  1st Qu.: 0.000e+00  
##  Median : 0.000e+00  
##  Mean   : 9.273e-18  
##  3rd Qu.: 0.000e+00  
##  Max.   : 7.105e-15
sumPDelta %>%
    group_by(cityName) %>%
    summarize(pctDiff=mean(abs(delta)>0.005))
## # A tibble: 6 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA            0
## 2 Houston TX           0
## 3 Las Vegas NV         0
## 4 Los Angeles CA       0
## 5 Miami FL             0
## 6 New York NY          0

After adjusting for time zone in Chicago, all observations align

The process is generalized in a function that adjusts only Chicago and with other significant hard-coding:

tmpDailyVsHourly <- function(df1=allCityHourly, 
                             df2=allCityDaily, 
                             chgCities=c("Chicago IL"),
                             chgHours=-1,
                             varDF1="precipitation",
                             varDF2="precipitation_sum",
                             fn=sum,
                             eqTol=0.005
                             ) {

    df <- df1 %>%
        mutate(date=as.Date(ifelse(cityName %in% chgCities, time+hours(chgHours), date))) %>%
        group_by(cityName, date) %>%
        summarize(across(.cols=all_of(varDF1), .fns=fn, .names="agg"), .groups="drop") %>%
        filter(year(date)<=2022) %>%
        left_join(select(df2, all_of(c("cityName", "date", varDF2))), 
                         by=c("cityName", "date")
                  ) %>%
        mutate(delta=agg-get(varDF2))

    df %>%
        summary() %>%
        print()

    df %>%
        group_by(cityName) %>%
        summarize(pctDiff=mean(abs(delta)>eqTol)) %>%
        print()
}

tmpDailyVsHourly(varDF1="precipitation", varDF2="precipitation_sum", fn=sum, eqTol=0.005)
##    cityName              date                 agg          precipitation_sum
##  Length:28488       Min.   :2010-01-01   Min.   :  0.000   Min.   :  0.000  
##  Class :character   1st Qu.:2013-04-01   1st Qu.:  0.000   1st Qu.:  0.000  
##  Mode  :character   Median :2016-07-01   Median :  0.000   Median :  0.000  
##                     Mean   :2016-07-01   Mean   :  2.499   Mean   :  2.499  
##                     3rd Qu.:2019-10-01   3rd Qu.:  1.400   3rd Qu.:  1.400  
##                     Max.   :2022-12-31   Max.   :143.900   Max.   :143.900  
##      delta           
##  Min.   :-1.421e-14  
##  1st Qu.: 0.000e+00  
##  Median : 0.000e+00  
##  Mean   : 9.273e-18  
##  3rd Qu.: 0.000e+00  
##  Max.   : 7.105e-15  
## # A tibble: 6 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA            0
## 2 Houston TX           0
## 3 Las Vegas NV         0
## 4 Los Angeles CA       0
## 5 Miami FL             0
## 6 New York NY          0

The function is tested for maximum temperature:

# No change in time
tmpDailyVsHourly(varDF1="temperature_2m", varDF2="temperature_2m_max", fn=max, eqTol=0.005, chgCities=c())
##    cityName              date                 agg         temperature_2m_max
##  Length:33236       Min.   :2010-01-01   Min.   :-22.50   Min.   :-22.80    
##  Class :character   1st Qu.:2013-04-01   1st Qu.: 14.70   1st Qu.: 14.70    
##  Mode  :character   Median :2016-07-01   Median : 23.60   Median : 23.60    
##                     Mean   :2016-07-01   Mean   : 21.34   Mean   : 21.33    
##                     3rd Qu.:2019-10-01   3rd Qu.: 28.80   3rd Qu.: 28.80    
##                     Max.   :2022-12-31   Max.   : 45.90   Max.   : 45.90    
##      delta          
##  Min.   :-2.700000  
##  1st Qu.: 0.000000  
##  Median : 0.000000  
##  Mean   : 0.005277  
##  3rd Qu.: 0.000000  
##  Max.   : 5.900000  
## # A tibble: 7 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA        0    
## 2 Chicago IL       0.104
## 3 Houston TX       0    
## 4 Las Vegas NV     0    
## 5 Los Angeles CA   0    
## 6 Miami FL         0    
## 7 New York NY      0
# Chicago shifted by -1 hours (default)
tmpDailyVsHourly(varDF1="temperature_2m", varDF2="temperature_2m_max", fn=max, eqTol=0.005)
##    cityName              date                 agg         temperature_2m_max
##  Length:28488       Min.   :2010-01-01   Min.   :-15.40   Min.   :-15.40    
##  Class :character   1st Qu.:2013-04-01   1st Qu.: 16.50   1st Qu.: 16.50    
##  Mode  :character   Median :2016-07-01   Median : 24.60   Median : 24.60    
##                     Mean   :2016-07-01   Mean   : 22.51   Mean   : 22.51    
##                     3rd Qu.:2019-10-01   3rd Qu.: 29.30   3rd Qu.: 29.30    
##                     Max.   :2022-12-31   Max.   : 45.90   Max.   : 45.90    
##      delta  
##  Min.   :0  
##  1st Qu.:0  
##  Median :0  
##  Mean   :0  
##  3rd Qu.:0  
##  Max.   :0  
## # A tibble: 6 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA            0
## 2 Houston TX           0
## 3 Las Vegas NV         0
## 4 Los Angeles CA       0
## 5 Miami FL             0
## 6 New York NY          0

The function is tested for maximum and minimum apparent temperature:

# Maximum apparent temperature, with Chicago time shift
tmpDailyVsHourly(varDF1="apparent_temperature", varDF2="apparent_temperature_max", fn=max, eqTol=0.005)
##    cityName              date                 agg        
##  Length:28488       Min.   :2010-01-01   Min.   :-23.60  
##  Class :character   1st Qu.:2013-04-01   1st Qu.: 13.90  
##  Mode  :character   Median :2016-07-01   Median : 24.30  
##                     Mean   :2016-07-01   Mean   : 22.04  
##                     3rd Qu.:2019-10-01   3rd Qu.: 31.40  
##                     Max.   :2022-12-31   Max.   : 45.60  
##  apparent_temperature_max     delta  
##  Min.   :-23.60           Min.   :0  
##  1st Qu.: 13.90           1st Qu.:0  
##  Median : 24.30           Median :0  
##  Mean   : 22.04           Mean   :0  
##  3rd Qu.: 31.40           3rd Qu.:0  
##  Max.   : 45.60           Max.   :0  
## # A tibble: 6 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA            0
## 2 Houston TX           0
## 3 Las Vegas NV         0
## 4 Los Angeles CA       0
## 5 Miami FL             0
## 6 New York NY          0
# Minimum apparent temperature, with Chicago time shift
tmpDailyVsHourly(varDF1="apparent_temperature", varDF2="apparent_temperature_min", fn=min, eqTol=0.005)
##    cityName              date                 agg        
##  Length:28488       Min.   :2010-01-01   Min.   :-33.10  
##  Class :character   1st Qu.:2013-04-01   1st Qu.:  3.30  
##  Mode  :character   Median :2016-07-01   Median : 12.90  
##                     Mean   :2016-07-01   Mean   : 12.12  
##                     3rd Qu.:2019-10-01   3rd Qu.: 22.00  
##                     Max.   :2022-12-31   Max.   : 33.20  
##  apparent_temperature_min     delta  
##  Min.   :-33.10           Min.   :0  
##  1st Qu.:  3.30           1st Qu.:0  
##  Median : 12.90           Median :0  
##  Mean   : 12.12           Mean   :0  
##  3rd Qu.: 22.00           3rd Qu.:0  
##  Max.   : 33.20           Max.   :0  
## # A tibble: 6 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA            0
## 2 Houston TX           0
## 3 Las Vegas NV         0
## 4 Los Angeles CA       0
## 5 Miami FL             0
## 6 New York NY          0

The function is tested for sum of precipitation and snowfall:

# Precipitation, with Chicago time shift
tmpDailyVsHourly(varDF1="precipitation", varDF2="precipitation_sum", fn=sum, eqTol=0.005)
##    cityName              date                 agg          precipitation_sum
##  Length:28488       Min.   :2010-01-01   Min.   :  0.000   Min.   :  0.000  
##  Class :character   1st Qu.:2013-04-01   1st Qu.:  0.000   1st Qu.:  0.000  
##  Mode  :character   Median :2016-07-01   Median :  0.000   Median :  0.000  
##                     Mean   :2016-07-01   Mean   :  2.499   Mean   :  2.499  
##                     3rd Qu.:2019-10-01   3rd Qu.:  1.400   3rd Qu.:  1.400  
##                     Max.   :2022-12-31   Max.   :143.900   Max.   :143.900  
##      delta           
##  Min.   :-1.421e-14  
##  1st Qu.: 0.000e+00  
##  Median : 0.000e+00  
##  Mean   : 9.273e-18  
##  3rd Qu.: 0.000e+00  
##  Max.   : 7.105e-15  
## # A tibble: 6 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA            0
## 2 Houston TX           0
## 3 Las Vegas NV         0
## 4 Los Angeles CA       0
## 5 Miami FL             0
## 6 New York NY          0
# Snowfall, with Chicago time shift
tmpDailyVsHourly(varDF1="snowfall", varDF2="snowfall_sum", fn=sum, eqTol=0.005)
##    cityName              date                 agg            snowfall_sum     
##  Length:28488       Min.   :2010-01-01   Min.   : 0.00000   Min.   : 0.00000  
##  Class :character   1st Qu.:2013-04-01   1st Qu.: 0.00000   1st Qu.: 0.00000  
##  Mode  :character   Median :2016-07-01   Median : 0.00000   Median : 0.00000  
##                     Mean   :2016-07-01   Mean   : 0.08226   Mean   : 0.08226  
##                     3rd Qu.:2019-10-01   3rd Qu.: 0.00000   3rd Qu.: 0.00000  
##                     Max.   :2022-12-31   Max.   :32.83000   Max.   :32.83000  
##      delta           
##  Min.   :-3.553e-15  
##  1st Qu.: 0.000e+00  
##  Median : 0.000e+00  
##  Mean   : 1.271e-18  
##  3rd Qu.: 0.000e+00  
##  Max.   : 3.553e-15  
## # A tibble: 6 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA            0
## 2 Houston TX           0
## 3 Las Vegas NV         0
## 4 Los Angeles CA       0
## 5 Miami FL             0
## 6 New York NY          0

The function is tested for hours of precipitation:

# Precipitation hours, with Chicago time shift
tmpDailyVsHourly(varDF1="precipitation", varDF2="precipitation_hours", fn=function(x) sum(x>0), eqTol=0.005)
##    cityName              date                 agg         precipitation_hours
##  Length:28488       Min.   :2010-01-01   Min.   : 0.000   Min.   : 0.000     
##  Class :character   1st Qu.:2013-04-01   1st Qu.: 0.000   1st Qu.: 0.000     
##  Mode  :character   Median :2016-07-01   Median : 0.000   Median : 0.000     
##                     Mean   :2016-07-01   Mean   : 2.992   Mean   : 2.992     
##                     3rd Qu.:2019-10-01   3rd Qu.: 4.000   3rd Qu.: 4.000     
##                     Max.   :2022-12-31   Max.   :24.000   Max.   :24.000     
##      delta  
##  Min.   :0  
##  1st Qu.:0  
##  Median :0  
##  Mean   :0  
##  3rd Qu.:0  
##  Max.   :0  
## # A tibble: 6 × 2
##   cityName       pctDiff
##   <chr>            <dbl>
## 1 Boston MA            0
## 2 Houston TX           0
## 3 Las Vegas NV         0
## 4 Los Angeles CA       0
## 5 Miami FL             0
## 6 New York NY          0

A random forest model is created to estimate dewpoint based on temperature and apparent temperature given city and month:

# Use only 10% as training data
set.seed(25091014)
idxTrain_v1 <- sample(1:nrow(allCityHourly), round(0.1*nrow(allCityHourly)), replace=FALSE)

rfTestDewP <- runFullRF(dfTrain=allCityHourly[idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ) %>%
                            select(dewpoint_2m, temperature_2m, apparent_temperature, fct_city, fct_mon),
                        dfTest=allCityHourly[-idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ),
                        yVar="dewpoint_2m", 
                        xVars=c("temperature_2m", "apparent_temperature", "fct_city", "fct_mon"),
                        isContVar=TRUE,
                        rndTo=-1L,
                        returnData=TRUE
                        )

## 
## R-squared of test data is: 93.31% (RMSE 2.84 vs. 10.98 null)
## `geom_smooth()` using formula = 'y ~ x'

rfTestDewP$tstPred %>%
    group_by(fct_city, fct_mon) %>%
    summarize(n=n(), 
              tssActual=sum((dewpoint_2m-mean(dewpoint_2m))**2), 
              rssPred=sum((dewpoint_2m-pred)**2), 
              r2=1-rssPred/tssActual, 
              rmse=sqrt(rssPred/n), 
              .groups="drop"
              ) %>%
    ggplot(aes(x=fct_city, y=fct_mon)) + 
    geom_tile(aes(fill=r2)) + 
    geom_text(aes(label=round(r2, 3)), size=3) +
    scale_fill_continuous("R2", low="white", high="lightgreen", limits=c(0, 1)) + 
    labs(x=NULL, y=NULL, title="R2 for dewpoint predictions")

Actual daily dewpoints are calculated using hourly data:

dewPDaily <- allCityHourly %>% 
    mutate(fct_city=factor(cityName), 
           fct_mon=factor(month.abb[month(date)], levels=month.abb)
           ) %>%
    group_by(fct_city, fct_mon, date) %>%
    summarize(across(dewpoint_2m, .fns=list(mean=mean, min=min, max=max)), .groups="drop")
dewPDaily
## # A tibble: 35,592 × 6
##    fct_city  fct_mon date       dewpoint_2m_mean dewpoint_2m_min dewpoint_2m_max
##    <fct>     <fct>   <date>                <dbl>           <dbl>           <dbl>
##  1 Boston MA Jan     2010-01-01            -3.48            -5.1            -1.5
##  2 Boston MA Jan     2010-01-02            -4.96           -10              -0.6
##  3 Boston MA Jan     2010-01-03            -9.23           -12.8            -4.7
##  4 Boston MA Jan     2010-01-04            -6.86            -8.6            -4.7
##  5 Boston MA Jan     2010-01-05            -8.16            -9.4            -6.5
##  6 Boston MA Jan     2010-01-06            -8.53            -9.3            -7.4
##  7 Boston MA Jan     2010-01-07            -7.86            -9.2            -5.5
##  8 Boston MA Jan     2010-01-08            -8.65           -10.9            -7.5
##  9 Boston MA Jan     2010-01-09           -13.5            -17             -11.1
## 10 Boston MA Jan     2010-01-10           -16.4            -18             -14.4
## # ℹ 35,582 more rows

Annual average dewpoints by city are explored:

dewPDaily %>%
    mutate(year=year(date)) %>%
    filter(year>=2010, year<2023) %>%
    group_by(year, fct_city) %>%
    summarize(mudp=mean(dewpoint_2m_mean), .groups="drop") %>%
    ggplot(aes(x=factor(year), y=mudp)) + 
    geom_line(aes(group=fct_city, color=fct_city), lwd=1) + 
    labs(title="Average dewpoint by city and year", y="Average dewpoint (C)", x=NULL) + 
    scale_color_discrete(NULL)

Monthly average dewpoints by city are explored:

dewPDaily %>%
    mutate(year=year(date)) %>%
    filter(year>=2010, year<2023) %>%
    group_by(fct_city, fct_mon) %>%
    summarize(mudp=mean(dewpoint_2m_mean), 
              maxdp=max(dewpoint_2m_mean),
              mindp=min(dewpoint_2m_mean),
              .groups="drop"
              ) %>%
    ggplot(aes(x=fct_mon)) + 
    geom_line(aes(y=mudp, group=fct_city, color=fct_city), lwd=1) + 
    labs(title="Average dewpoint by city and month", y="Average dewpoint (C)", x=NULL) + 
    scale_color_discrete(NULL)

Rolling 21-day average dewpoints by city are explored:

dewPDaily %>%
    mutate(year=year(date), doy=pmin(365, yday(date))) %>%
    arrange(fct_city, date) %>%
    group_by(fct_city) %>%
    helperRollingAgg(origVar="dewpoint_2m_mean", newName="dewpoint_r21", k=21) %>%
    ungroup() %>%
    filter(year>=2010, year<2023, !is.na(dewpoint_r21)) %>%
    group_by(fct_city, doy) %>%
    summarize(mudp=mean(dewpoint_r21), 
              maxdp=max(dewpoint_r21),
              mindp=min(dewpoint_r21),
              .groups="drop"
              ) %>%
    ggplot(aes(x=doy)) + 
    geom_line(aes(y=mudp, group=fct_city, color=fct_city), lwd=1) + 
    labs(title="Average dewpoint by city and day of year", 
         subtitle="Rolling 21-day",
         y="Average rolling 21-day dewpoint (C)", 
         x="Day of year"
         ) + 
    scale_color_discrete(NULL)

Dew points are added to the daily data:

allCityDaily_v2 <- 
    allCityDaily %>%
    left_join(dewPDaily %>%
                  mutate(cityName=as.character(fct_city)) %>%
                  select(cityName, date, starts_with("dewp")), 
              by=c("cityName", "date")
              )

allCityDaily_v2 %>%
    group_by(cityName) %>%
    summarize(n=n(), hasDew=sum(!is.na(dewpoint_2m_mean)))
## # A tibble: 7 × 3
##   cityName           n hasDew
##   <chr>          <int>  <int>
## 1 Boston MA       5113   5113
## 2 Chicago IL     23376   5113
## 3 Houston TX      5113   5113
## 4 Las Vegas NV    5113   5113
## 5 Los Angeles CA  5113   5113
## 6 Miami FL        5113   5113
## 7 New York NY     4914   4914

Relationships among temperature, dewpoint, and apparent temperature are explored:

allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    mutate(across(where(is.numeric), .fns=round)) %>%
    count(cityName, muTemp, muApp) %>%
    ggplot(aes(x=muTemp, y=muApp)) + 
    geom_point(aes(size=n), alpha=0.5)+ 
    facet_wrap(~cityName) + 
    labs(title="July apparent temperature vs. temperature", 
         x="Average daily temperature (C)", 
         y="Average daily\napparent temperature (C)"
         ) + 
    scale_size_continuous(NULL)

allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    mutate(across(where(is.numeric), .fns=round)) %>%
    count(cityName, muDew, muApp) %>%
    ggplot(aes(x=muDew, y=muApp)) + 
    geom_point(aes(size=n), alpha=0.5)+ 
    facet_wrap(~cityName) + 
    labs(title="July apparent temperature vs. dewpoint", 
         x="Average daily dewpoint (C)", 
         y="Average daily\napparent temperature (C)"
         ) + 
    scale_size_continuous(NULL)

allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    mutate(across(where(is.numeric), .fns=round)) %>%
    count(cityName, muDew, muTemp) %>%
    ggplot(aes(x=muDew, y=muTemp)) + 
    geom_point(aes(size=n), alpha=0.5)+ 
    facet_wrap(~cityName) + 
    labs(title="July temperature vs. dewpoint", 
         x="Average daily dewpoint (C)", 
         y="Average daily\ntemperature (C)"
         ) + 
    scale_size_continuous(NULL)

Relationships among temperature, dewpoint, and apparent temperature are further explored:

allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    mutate(across(where(is.numeric), .fns=round), 
           appr5=autoRound(muApp, rndTo=5L)
           ) %>%
    count(cityName, muDew, muTemp, appr5) %>%
    ggplot(aes(x=muDew, y=muTemp)) + 
    geom_point(aes(size=n, color=factor(appr5)), alpha=0.5)+ 
    facet_wrap(~cityName) + 
    labs(title="July temperature vs. dewpoint", 
         x="Average daily dewpoint (C)", 
         y="Average daily\ntemperature (C)"
         ) + 
    scale_size_continuous(NULL) + 
    scale_color_discrete("App Temp\n(nearest 5 C)")

A simple linear regression is explored:

allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    lm(muApp ~ muTemp, data=.) %>%
    summary()
## 
## Call:
## lm(formula = muApp ~ muTemp, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1693 -1.3825 -0.1217  1.9902  4.6627 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.40918    0.29867   4.718  2.5e-06 ***
## muTemp       1.02750    0.01116  92.101  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.362 on 2819 degrees of freedom
## Multiple R-squared:  0.7506, Adjusted R-squared:  0.7505 
## F-statistic:  8483 on 1 and 2819 DF,  p-value: < 2.2e-16
allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    lm(muApp ~ muDew, data=.) %>%
    summary()
## 
## Call:
## lm(formula = muApp ~ muDew, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.865  -3.368   0.385   3.460  12.594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.10439    0.23536  102.42   <2e-16 ***
## muDew        0.26039    0.01273   20.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.413 on 2819 degrees of freedom
## Multiple R-squared:  0.1293, Adjusted R-squared:  0.129 
## F-statistic: 418.7 on 1 and 2819 DF,  p-value: < 2.2e-16

On average, temperature is better than dewpoint as a predictor of apparent temperature. Interaction is also explored:

allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    lm(muApp ~ muTemp + muDew, data=.) %>%
    summary()
## 
## Call:
## lm(formula = muApp ~ muTemp + muDew, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5790 -0.5599  0.0916  0.6567  3.2703 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.110891   0.126564  -48.28   <2e-16 ***
## muTemp       1.091163   0.004236  257.58   <2e-16 ***
## muDew        0.337196   0.002586  130.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.891 on 2818 degrees of freedom
## Multiple R-squared:  0.9645, Adjusted R-squared:  0.9645 
## F-statistic: 3.831e+04 on 2 and 2818 DF,  p-value: < 2.2e-16
allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    lm(muApp ~ muTemp + muDew + muTemp:muDew, data=.) %>%
    summary()
## 
## Call:
## lm(formula = muApp ~ muTemp + muDew + muTemp:muDew, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5912 -0.5465  0.0883  0.6354  3.2415 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.0232152  0.3884503 -10.357  < 2e-16 ***
## muTemp        1.0234331  0.0126449  80.936  < 2e-16 ***
## muDew         0.1842179  0.0270512   6.810 1.19e-11 ***
## muTemp:muDew  0.0051692  0.0009099   5.681 1.48e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8861 on 2817 degrees of freedom
## Multiple R-squared:  0.9649, Adjusted R-squared:  0.9649 
## F-statistic: 2.583e+04 on 3 and 2817 DF,  p-value: < 2.2e-16

The inclusion of both temperature and dewpoint meaningfully decreases residual standard error. The interaction term between temperature and dewpoint, while statistically significant, has little practical impact. City is added as an additional explanatory variable:

allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    lm(muApp ~ muTemp + muDew + cityName + 0, data=.) %>%
    summary()
## 
## Call:
## lm(formula = muApp ~ muTemp + muDew + cityName + 0, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.90625 -0.42940  0.06619  0.46037  2.76072 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## muTemp                  1.035693   0.005850  177.03   <2e-16 ***
## muDew                   0.301535   0.004165   72.39   <2e-16 ***
## cityNameBoston MA      -4.705775   0.145150  -32.42   <2e-16 ***
## cityNameChicago IL     -4.732561   0.146445  -32.32   <2e-16 ***
## cityNameHouston TX     -3.211754   0.177352  -18.11   <2e-16 ***
## cityNameLas Vegas NV   -4.206984   0.188979  -22.26   <2e-16 ***
## cityNameLos Angeles CA -3.749403   0.140676  -26.65   <2e-16 ***
## cityNameMiami FL       -3.267359   0.174773  -18.70   <2e-16 ***
## cityNameNew York NY    -4.303946   0.154345  -27.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7383 on 2812 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 4.832e+05 on 9 and 2812 DF,  p-value: < 2.2e-16
allCityDaily_v2 %>%
    mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2, 
           muApp=(apparent_temperature_max + apparent_temperature_min)/2,
           muDew=dewpoint_2m_mean
           ) %>%
    filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
    select(cityName, muTemp, muDew, muApp) %>%
    lm(muApp ~ muTemp:cityName + muDew:cityName + cityName + 0, data=.) %>%
    summary()
## 
## Call:
## lm(formula = muApp ~ muTemp:cityName + muDew:cityName + cityName + 
##     0, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.08814 -0.39255  0.06184  0.42885  1.90562 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## cityNameBoston MA              -4.607592   0.285746 -16.125  < 2e-16 ***
## cityNameChicago IL             -6.294447   0.250438 -25.134  < 2e-16 ***
## cityNameHouston TX             -6.352102   0.886444  -7.166 9.84e-13 ***
## cityNameLas Vegas NV           -0.360121   0.447277  -0.805    0.421    
## cityNameLos Angeles CA         -2.990764   0.372639  -8.026 1.47e-15 ***
## cityNameMiami FL              -16.050511   1.563984 -10.263  < 2e-16 ***
## cityNameNew York NY            -6.136167   0.349298 -17.567  < 2e-16 ***
## muTemp:cityNameBoston MA        0.984007   0.013408  73.392  < 2e-16 ***
## muTemp:cityNameChicago IL       1.063621   0.016338  65.100  < 2e-16 ***
## muTemp:cityNameHouston TX       1.008152   0.022276  45.258  < 2e-16 ***
## muTemp:cityNameLas Vegas NV     0.929141   0.013618  68.228  < 2e-16 ***
## muTemp:cityNameLos Angeles CA   0.993172   0.012892  77.041  < 2e-16 ***
## muTemp:cityNameMiami FL         1.119919   0.036174  30.960  < 2e-16 ***
## muTemp:cityNameNew York NY      1.043039   0.015807  65.984  < 2e-16 ***
## cityNameBoston MA:muDew         0.365493   0.012838  28.470  < 2e-16 ***
## cityNameChicago IL:muDew        0.350936   0.016710  21.001  < 2e-16 ***
## cityNameHouston TX:muDew        0.469955   0.023034  20.403  < 2e-16 ***
## cityNameLas Vegas NV:muDew      0.239456   0.005067  47.254  < 2e-16 ***
## cityNameLos Angeles CA:muDew    0.319501   0.014467  22.085  < 2e-16 ***
## cityNameMiami FL:muDew          0.741580   0.050220  14.767  < 2e-16 ***
## cityNameNew York NY:muDew       0.389827   0.012843  30.352  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6826 on 2800 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 2.423e+05 on 21 and 2800 DF,  p-value: < 2.2e-16

Including city reduces residual error slightly

Dewpoints by month are further explored:

dewPDaily %>%
    mutate(gte68=(dewpoint_2m_mean>=20), 
           lte32=(dewpoint_2m_mean<=0)
           ) %>%
    group_by(fct_city, fct_mon) %>%
    summarize(gte68=mean(gte68), lte32=mean(lte32), .groups="drop") %>%
    pivot_longer(cols=-c(fct_city, fct_mon)) %>%
    ggplot(aes(x=fct_mon, y=value)) + 
    geom_line(aes(group=name, color=c("gte68"=">=68F", "lte32"="<=32F")[name])) + 
    facet_wrap(~fct_city) + 
    labs(x=NULL, y="Proportion of days", title="Proportion of days by month by dewpoint range") + 
    scale_color_discrete("Dewpoint")

Dewpoints at or above 20C (68F) by year are explored:

dewPDaily %>%
    mutate(gte68=(dewpoint_2m_mean>=20), 
           lte32=(dewpoint_2m_mean<=0), 
           year=year(date), 
           doy=yday(date)
           ) %>%
    filter(year < 2023) %>%
    arrange(fct_city, date) %>%
    group_by(fct_city, year) %>%
    mutate(gte68=cumsum(gte68), lte32=cumsum(lte32)) %>%
    ungroup() %>%
    ggplot(aes(x=doy, y=gte68)) + 
    geom_line(aes(group=factor(year), color=factor(year))) + 
    facet_wrap(~fct_city, scales="free_y") + 
    labs(x=NULL, 
         y="Days with dewpoint >= 20C (68F)", 
         title="Cumulative days by year with dewpoint >= 20C (68F)"
         ) + 
    scale_color_discrete(NULL)

Dewpoints at or below 0C (32F) by year are explored:

dewPDaily %>%
    mutate(gte68=(dewpoint_2m_mean>=20), 
           lte32=(dewpoint_2m_mean<=0), 
           year=year(date), 
           doy=yday(date)
           ) %>%
    filter(year < 2023) %>%
    arrange(fct_city, date) %>%
    group_by(fct_city, year) %>%
    mutate(gte68=cumsum(gte68), lte32=cumsum(lte32)) %>%
    ungroup() %>%
    ggplot(aes(x=doy, y=lte32)) + 
    geom_line(aes(group=factor(year), color=factor(year))) + 
    facet_wrap(~fct_city, scales="free_y") + 
    labs(x=NULL, 
         y="Days with dewpoint <= 0C (32F)", 
         title="Cumulative days by year with dewpoint <= 0C (32F)"
         ) + 
    scale_color_discrete(NULL)

Dewpoints at or above 20C (68F) by month are explored:

dewPDaily %>%
    mutate(gte68=(dewpoint_2m_mean>=20), 
           lte32=(dewpoint_2m_mean<=0), 
           year=year(date), 
           doy=yday(date)
           ) %>%
    filter(year < 2023) %>%
    group_by(fct_city, fct_mon) %>%
    summarize(gte68=sum(gte68), lte32=sum(lte32), .groups="drop") %>%
    ggplot(aes(x=fct_mon, y=gte68)) + 
    geom_col(fill="lightblue") +
    geom_text(aes(label=gte68), size=2.5, vjust=0) +
    facet_wrap(~fct_city) + 
    labs(x=NULL, 
         y="# Days (2010-2022)", 
         title="Days with dewpoint >= 20C (68F)\n2010-2022"
         ) + 
    scale_color_discrete(NULL)

Dewpoints at or below 0C (32F) by month are explored:

dewPDaily %>%
    mutate(gte68=(dewpoint_2m_mean>=20), 
           lte32=(dewpoint_2m_mean<=0), 
           year=year(date), 
           doy=yday(date)
           ) %>%
    filter(year < 2023) %>%
    group_by(fct_city, fct_mon) %>%
    summarize(gte68=sum(gte68), lte32=sum(lte32), .groups="drop") %>%
    ggplot(aes(x=fct_mon, y=lte32)) + 
    geom_col(fill="lightblue") +
    geom_text(aes(label=lte32), size=2.5, vjust=0) +
    facet_wrap(~fct_city) + 
    labs(x=NULL, 
         y="# Days (2010-2022)", 
         title="Days with dewpoint <= 0C (32F)\n2010-2022"
         ) + 
    scale_color_discrete(NULL)

Proportion of days with dewpoints at or above 20C (68F) by month are explored:

dewPDaily %>%
    mutate(gte68=(dewpoint_2m_mean>=20), 
           lte32=(dewpoint_2m_mean<=0), 
           year=year(date), 
           doy=yday(date)
           ) %>%
    filter(year < 2023) %>%
    group_by(fct_city, fct_mon) %>%
    summarize(gte68=mean(gte68), lte32=mean(lte32), .groups="drop") %>%
    ggplot(aes(x=fct_mon, y=gte68)) + 
    geom_col(fill="lightblue") +
    geom_text(aes(label=round(gte68,2)), size=2.5, vjust=0) +
    facet_wrap(~fct_city) + 
    labs(x=NULL, 
         y="Proportion of Days (2010-2022)", 
         title="Days with dewpoint >= 20C (68F)\n2010-2022"
         ) + 
    scale_color_discrete(NULL)

Proportion of days with dewpoints at or below 0C (32F) by month are explored:

dewPDaily %>%
    mutate(gte68=(dewpoint_2m_mean>=20), 
           lte32=(dewpoint_2m_mean<=0), 
           year=year(date), 
           doy=yday(date)
           ) %>%
    filter(year < 2023) %>%
    group_by(fct_city, fct_mon) %>%
    summarize(gte68=mean(gte68), lte32=mean(lte32), .groups="drop") %>%
    ggplot(aes(x=fct_mon, y=lte32)) + 
    geom_col(fill="lightblue") +
    geom_text(aes(label=round(lte32,2)), size=2.5, vjust=0) +
    facet_wrap(~fct_city) + 
    labs(x=NULL, 
         y="Proportion of Days (2010-2022)", 
         title="Days with dewpoint <= 0C (32F)\n2010-2022"
         ) + 
    scale_color_discrete(NULL)

ACF is calculated for dew point for a single city:

dewPDaily %>% 
    filter(fct_city=="Chicago IL") %>% 
    arrange(date) %>% 
    select(dewpoint_2m_mean) %>% 
    acf(lag.max=1000, main="ACF for daily dewpoint - Chicago IL (2010-2022)")

ACF is calculated for dew point for each city:

dewCities <- dewPDaily %>% 
    pull(fct_city) %>% 
    as.vector() %>% 
    unique()

dewACF <- lapply(dewCities, 
                 FUN=function(x) {
                     dewPDaily %>% 
                         filter(fct_city==x) %>% 
                         arrange(date) %>% 
                         select(dewpoint_2m_mean) %>% 
                         acf(lag.max=2000, plot=FALSE)
                     }
                 )
names(dewACF) <- dewCities

ACF are combined into a single tibble:

dewACFAll <- map_dfr(.x=dewCities, 
                     .f=function(x) tibble::tibble(lag=as.vector(dewACF[[x]][["lag"]]), 
                                                   acf=as.vector(dewACF[[x]][["acf"]])
                                                   ), 
                     .id="src") %>% 
    mutate(cityName=dewCities[as.integer(src)])

dewACFAll
## # A tibble: 14,007 × 4
##    src     lag   acf cityName 
##    <chr> <dbl> <dbl> <chr>    
##  1 1         0 1     Boston MA
##  2 1         1 0.882 Boston MA
##  3 1         2 0.778 Boston MA
##  4 1         3 0.754 Boston MA
##  5 1         4 0.750 Boston MA
##  6 1         5 0.749 Boston MA
##  7 1         6 0.750 Boston MA
##  8 1         7 0.749 Boston MA
##  9 1         8 0.741 Boston MA
## 10 1         9 0.731 Boston MA
## # ℹ 13,997 more rows
dewACFAll %>%
    filter(lag > 0) %>%
    group_by(cityName) %>%
    summarize(mu_abs_acf=mean(abs(acf))) %>%
    arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Chicago IL          0.388
## 2 Boston MA           0.377
## 3 New York NY         0.372
## 4 Houston TX          0.280
## 5 Miami FL            0.250
## 6 Los Angeles CA      0.222
## 7 Las Vegas NV        0.138

ACF by city are plotted:

dewACFAll %>% 
    filter(lag>0) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="ACF", title="ACF of dewpoint by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

PACF is calculated for dew point for a single city:

dewPDaily %>% 
    filter(fct_city=="Chicago IL") %>% 
    arrange(date) %>% 
    select(dewpoint_2m_mean) %>% 
    pacf(lag.max=1000, main="PACF for daily dewpoint - Chicago IL (2010-2022)")

dewPDaily %>% 
    filter(fct_city=="Chicago IL") %>% 
    arrange(date) %>% 
    select(dewpoint_2m_mean) %>% 
    pacf(lag.max=20, main="PACF for daily dewpoint - Chicago IL (2010-2022)")

PACF is calculated for dew point for each city:

dewPACF <- lapply(dewCities, 
                  FUN=function(x) {
                      dewPDaily %>% 
                          filter(fct_city==x) %>% 
                          arrange(date) %>% 
                          select(dewpoint_2m_mean) %>% 
                          pacf(lag.max=2000, plot=FALSE)
                      }
                  )
names(dewPACF) <- dewCities

PACF are combined into a single tibble:

dewPACFAll <- map_dfr(.x=dewCities, 
                      .f=function(x) tibble::tibble(lag=as.vector(dewPACF[[x]][["lag"]]), 
                                                    pacf=as.vector(dewPACF[[x]][["acf"]])
                                                   ), 
                      .id="src") %>% 
    mutate(cityName=dewCities[as.integer(src)])
dewPACFAll
## # A tibble: 14,000 × 4
##    src     lag     pacf cityName 
##    <chr> <dbl>    <dbl> <chr>    
##  1 1         1  0.882   Boston MA
##  2 1         2 -0.00177 Boston MA
##  3 1         3  0.306   Boston MA
##  4 1         4  0.127   Boston MA
##  5 1         5  0.158   Boston MA
##  6 1         6  0.126   Boston MA
##  7 1         7  0.103   Boston MA
##  8 1         8  0.0677  Boston MA
##  9 1         9  0.0515  Boston MA
## 10 1        10  0.0707  Boston MA
## # ℹ 13,990 more rows
dewPACFAll %>%
    filter(lag > 0) %>%
    group_by(cityName) %>%
    summarize(mu_abs_pacf=mean(abs(pacf))) %>%
    arrange(desc(mu_abs_pacf))
## # A tibble: 7 × 2
##   cityName       mu_abs_pacf
##   <chr>                <dbl>
## 1 Boston MA           0.0114
## 2 New York NY         0.0113
## 3 Chicago IL          0.0112
## 4 Houston TX          0.0110
## 5 Los Angeles CA      0.0109
## 6 Miami FL            0.0108
## 7 Las Vegas NV        0.0104

PACF by city are plotted:

dewPACFAll %>% 
    filter(lag>0) %>%
    ggplot(aes(x=lag, y=pacf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="PACF", title="PACF of dewpoint by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1)) + 
    scale_x_log10()

A baseline ARIMA model is explored, starting with c(0, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily %>% 
                       filter(fct_city==x) %>% 
                       pull(dewpoint_2m_mean) %>% 
                       arima(order=c(0, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
               se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       intercept     se sigma2 sigma
##   <chr>              <dbl>  <dbl>  <dbl> <dbl>
## 1 Miami FL           19.8  0.0629   20.3  4.50
## 2 Los Angeles CA      6.85 0.0953   46.4  6.81
## 3 Las Vegas NV       -1.59 0.102    53.0  7.28
## 4 Houston TX         15.8  0.109    61.1  7.82
## 5 New York NY         6.56 0.144   102.  10.1 
## 6 Boston MA           5.47 0.143   104.  10.2 
## 7 Chicago IL          5.43 0.150   115.  10.7

The ARIMA model is explored at c(1, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(dewpoint_2m_mean) %>% 
                       arima(order=c(1, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ar1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Miami FL         4.85  2.20 0.872     19.8 
## 2 Los Angeles CA  10.9   3.30 0.875      6.85
## 3 Las Vegas NV    12.8   3.58 0.871     -1.59
## 4 Chicago IL      16.8   4.10 0.924      5.42
## 5 Houston TX      17.7   4.20 0.843     15.8 
## 6 New York NY     21.1   4.60 0.891      6.56
## 7 Boston MA       23.0   4.80 0.882      5.47

The ARIMA model is explored at c(0, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(dewpoint_2m_mean) %>% 
                       arima(order=c(0, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ma1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Miami FL         8.38  2.90 0.796     19.8 
## 2 Los Angeles CA  19.2   4.38 0.794      6.85
## 3 Las Vegas NV    22.5   4.75 0.776     -1.59
## 4 Houston TX      26.1   5.11 0.786     15.8 
## 5 New York NY     42.9   6.55 0.782      6.56
## 6 Chicago IL      43.6   6.60 0.821      5.42
## 7 Boston MA       44.4   6.66 0.776      5.47

The ARIMA model is explored at c(1, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(dewpoint_2m_mean) %>% 
                       arima(order=c(1, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma   ar1     ma1 intercept
##   <chr>           <dbl> <dbl> <dbl>   <dbl>     <dbl>
## 1 Miami FL         4.58  2.14 0.799 0.315       19.8 
## 2 Los Angeles CA  10.3   3.21 0.804 0.307        6.85
## 3 Las Vegas NV    12.2   3.50 0.807 0.269       -1.59
## 4 Houston TX      16.3   4.04 0.733 0.394       15.8 
## 5 Chicago IL      16.6   4.08 0.902 0.151        5.43
## 6 New York NY     21.1   4.60 0.878 0.0594       6.56
## 7 Boston MA       23.0   4.80 0.881 0.00734      5.47

The ARIMA model is explored at c(2, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(dewpoint_2m_mean) %>% 
                       arima(order=c(2, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma   ar1      ar2 intercept
##   <chr>           <dbl> <dbl> <dbl>    <dbl>     <dbl>
## 1 Miami FL         4.68  2.16 1.04  -0.187       19.8 
## 2 Los Angeles CA  10.5   3.24 1.03  -0.181        6.85
## 3 Las Vegas NV    12.4   3.52 1.03  -0.183       -1.59
## 4 Chicago IL      16.7   4.09 0.989 -0.0703       5.43
## 5 Houston TX      17.0   4.12 1.00  -0.191       15.8 
## 6 New York NY     21.1   4.60 0.908 -0.0194       6.56
## 7 Boston MA       23.0   4.80 0.884 -0.00206      5.47

The auto.arima() process is run to estimate optimal parameters for a single city:

dewPDaily %>% 
    filter(fct_city=="Chicago IL") %>% 
    arrange(date) %>%
    pull(dewpoint_2m_mean) %>% 
    forecast::auto.arima()
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Series: . 
## ARIMA(5,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     ar5    mean
##       0.9676  -0.3158  0.1975  -0.0276  0.1317  5.4251
## s.e.  0.0139   0.0194  0.0197   0.0194  0.0139  1.1627
## 
## sigma^2 = 15.17:  log likelihood = -14205.69
## AIC=28425.37   AICc=28425.39   BIC=28471.15

The auto.arima() process is run to estimate optimal parameters for all cities:

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(dewpoint_2m_mean) %>% 
                       forecast::auto.arima()
                   )
names(tstARIMA) <- dewCities

The optimal model by city is pulled:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t()
##                ar ma d  sig2
## Boston MA       5  0 0 20.01
## Chicago IL      5  0 0 15.17
## Houston TX      5  0 0 15.04
## Las Vegas NV    2  3 0 11.83
## Los Angeles CA  2  2 0  9.71
## Miami FL        2  4 1  4.26
## New York NY     5  0 0 18.56

Sigma-squared is also explored:

cat("\n\nOptimal Parameters (ARIMA for daily mean dewpoint by city):\n")
## 
## 
## Optimal Parameters (ARIMA for daily mean dewpoint by city):
sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("cityName") %>% 
    full_join(dewPDaily %>% group_by(fct_city) %>% summarize(varOrig=var(dewpoint_2m_mean)), 
              by=c("cityName"="fct_city")
              ) %>% 
    mutate(pct=1-sig2/varOrig)
##         cityName ar ma d  sig2   varOrig       pct
## 1      Boston MA  5  0 0 20.01 103.86622 0.8073483
## 2     Chicago IL  5  0 0 15.17 115.11075 0.8682139
## 3     Houston TX  5  0 0 15.04  61.13661 0.7539935
## 4   Las Vegas NV  2  3 0 11.83  53.01145 0.7768406
## 5 Los Angeles CA  2  2 0  9.71  46.41365 0.7907943
## 6       Miami FL  2  4 1  4.26  20.26281 0.7897626
## 7    New York NY  5  0 0 18.56 102.44193 0.8188242

Rolling 21-day dewpoints by city are plotted:

dewPDaily %>% 
    mutate(doy=yday(date)) %>% 
    arrange(date) %>% 
    group_by(fct_city) %>% 
    helperRollingAgg(origVar="dewpoint_2m_mean", newName="dewpoint_r21", k=21) %>% 
    group_by(fct_city, doy) %>% 
    summarize(dewpoint_r21=mean(dewpoint_r21, na.rm=TRUE), .groups="drop") %>% 
    ggplot(aes(x=doy)) + 
    geom_line(aes(y=dewpoint_r21), lwd=2) + 
    geom_point(data=dewPDaily %>% mutate(doy=yday(date), year=year(date)) %>% arrange(date), 
               aes(y=dewpoint_2m_mean, group=year), 
               alpha=0.25, 
               size=0.1
               ) + 
    facet_wrap(~fct_city) + 
    labs(x="Day of year", 
         y="Dewpoint (C)", 
         title="Dewpoint by day of year\n(actual and rolling 21-day average)"
         )

Daily dewpoint data are converted to deviation from long-term rolling 21-day mean by day of year:

dewPDaily_vs_r21 <- dewPDaily %>% 
    mutate(doy=pmin(yday(date), 365), year=year(date)) %>% 
    left_join(dewPDaily %>% 
                  mutate(doy=pmin(yday(date),365)) %>% 
                  arrange(date) %>%
                  group_by(fct_city) %>% 
                  helperRollingAgg(origVar="dewpoint_2m_mean", newName="dewpoint_r21", k=21) %>% 
                  group_by(fct_city, doy) %>% 
                  summarize(dewpoint_r21=mean(dewpoint_r21, na.rm=TRUE), .groups="drop"), 
              by=c("fct_city", "doy")
              ) %>% 
    mutate(delta_r21=dewpoint_2m_mean-dewpoint_r21)

dewPDaily_vs_r21
## # A tibble: 35,592 × 10
##    fct_city  fct_mon date       dewpoint_2m_mean dewpoint_2m_min dewpoint_2m_max
##    <fct>     <fct>   <date>                <dbl>           <dbl>           <dbl>
##  1 Boston MA Jan     2010-01-01            -3.48            -5.1            -1.5
##  2 Boston MA Jan     2010-01-02            -4.96           -10              -0.6
##  3 Boston MA Jan     2010-01-03            -9.23           -12.8            -4.7
##  4 Boston MA Jan     2010-01-04            -6.86            -8.6            -4.7
##  5 Boston MA Jan     2010-01-05            -8.16            -9.4            -6.5
##  6 Boston MA Jan     2010-01-06            -8.53            -9.3            -7.4
##  7 Boston MA Jan     2010-01-07            -7.86            -9.2            -5.5
##  8 Boston MA Jan     2010-01-08            -8.65           -10.9            -7.5
##  9 Boston MA Jan     2010-01-09           -13.5            -17             -11.1
## 10 Boston MA Jan     2010-01-10           -16.4            -18             -14.4
## # ℹ 35,582 more rows
## # ℹ 4 more variables: doy <dbl>, year <dbl>, dewpoint_r21 <dbl>,
## #   delta_r21 <dbl>
summary(dewPDaily_vs_r21)
##            fct_city       fct_mon           date            dewpoint_2m_mean   
##  Boston MA     :5113   Jan    : 3038   Min.   :2010-01-01   Min.   :-35.76667  
##  Chicago IL    :5113   Mar    : 3038   1st Qu.:2013-06-25   1st Qu.: -0.01771  
##  Houston TX    :5113   May    : 3038   Median :2016-12-17   Median :  9.47500  
##  Las Vegas NV  :5113   Jul    : 3007   Mean   :2016-12-17   Mean   :  8.34512  
##  Los Angeles CA:5113   Aug    : 3007   3rd Qu.:2020-06-10   3rd Qu.: 17.53333  
##  Miami FL      :5113   Oct    : 3007   Max.   :2023-12-31   Max.   : 25.94167  
##  New York NY   :4914   (Other):17457                                           
##  dewpoint_2m_min   dewpoint_2m_max       doy             year     
##  Min.   :-38.600   Min.   :-32.00   Min.   :  1.0   Min.   :2010  
##  1st Qu.: -3.500   1st Qu.:  3.80   1st Qu.: 91.0   1st Qu.:2013  
##  Median :  6.300   Median : 12.50   Median :182.0   Median :2016  
##  Mean   :  5.499   Mean   : 11.28   Mean   :182.6   Mean   :2016  
##  3rd Qu.: 15.200   3rd Qu.: 19.90   3rd Qu.:274.0   3rd Qu.:2020  
##  Max.   : 25.200   Max.   : 27.60   Max.   :365.0   Max.   :2023  
##                                                                   
##   dewpoint_r21        delta_r21        
##  Min.   :-8.98471   Min.   :-28.33807  
##  1st Qu.:-0.04931   1st Qu.: -2.77771  
##  Median : 8.78258   Median :  0.33513  
##  Mean   : 8.34959   Mean   : -0.00446  
##  3rd Qu.:16.58411   3rd Qu.:  3.06098  
##  Max.   :23.96929   Max.   : 19.74022  
## 

A baseline ARIMA model is explored, starting with c(0, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
               se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       intercept     se sigma2 sigma
##   <chr>              <dbl>  <dbl>  <dbl> <dbl>
## 1 Miami FL       -0.0220   0.0441   9.95  3.15
## 2 Los Angeles CA  0.0118   0.0701  25.1   5.01
## 3 Houston TX     -0.0141   0.0729  27.2   5.21
## 4 Boston MA       0.00479  0.0733  27.5   5.24
## 5 New York NY    -0.0177   0.0750  27.6   5.26
## 6 Chicago IL      0.00493  0.0740  28.0   5.29
## 7 Las Vegas NV    0.000614 0.0816  34.1   5.84

The ARIMA model is explored at c(1, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ar1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Miami FL         4.52  2.13 0.739 -0.0221  
## 2 Los Angeles CA  10.3   3.21 0.768  0.0118  
## 3 Las Vegas NV    12.3   3.51 0.799  0.000632
## 4 Chicago IL      14.7   3.84 0.688  0.00495 
## 5 Houston TX      15.8   3.97 0.648 -0.0141  
## 6 New York NY     17.8   4.22 0.596 -0.0177  
## 7 Boston MA       19.0   4.36 0.556  0.00478

The ARIMA model is explored at c(0, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ma1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Miami FL         5.08  2.25 0.724 -0.0219  
## 2 Los Angeles CA  12.3   3.51 0.734  0.0117  
## 3 Houston TX      15.1   3.88 0.706 -0.0141  
## 4 Chicago IL      15.8   3.98 0.678  0.00488 
## 5 Las Vegas NV    16.4   4.05 0.731  0.000566
## 6 New York NY     17.7   4.21 0.622 -0.0176  
## 7 Boston MA       18.4   4.29 0.606  0.00477

The ARIMA model is explored at c(1, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma   ar1   ma1 intercept
##   <chr>           <dbl> <dbl> <dbl> <dbl>     <dbl>
## 1 Miami FL         4.11  2.03 0.572 0.406 -0.0222  
## 2 Los Angeles CA   9.45  3.07 0.622 0.385  0.0118  
## 3 Las Vegas NV    11.6   3.40 0.691 0.318  0.000754
## 4 Houston TX      13.7   3.70 0.397 0.512 -0.0141  
## 5 Chicago IL      13.8   3.72 0.503 0.371  0.00503 
## 6 New York NY     16.8   4.10 0.353 0.397 -0.0178  
## 7 Boston MA       17.9   4.23 0.276 0.429  0.00474

The ARIMA model is explored at c(2, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(2, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma   ar1    ar2 intercept
##   <chr>           <dbl> <dbl> <dbl>  <dbl>     <dbl>
## 1 Miami FL         4.17  2.04 0.945 -0.279 -0.0219  
## 2 Los Angeles CA   9.62  3.10 0.963 -0.253  0.0118  
## 3 Las Vegas NV    11.7   3.41 0.982 -0.229  0.000716
## 4 Chicago IL      14.0   3.75 0.839 -0.219  0.00523 
## 5 Houston TX      14.0   3.75 0.863 -0.332 -0.0139  
## 6 New York NY     17.1   4.13 0.719 -0.207 -0.0176  
## 7 Boston MA       18.1   4.26 0.673 -0.211  0.00475

Sigma-squared for a single city is compared across models:

pdqList <- list("c(0, 0, 0)"=c(0, 0, 0), 
                "c(1, 0, 0)"=c(1, 0, 0), 
                "c(0, 0, 1)"=c(0, 0, 1), 
                "c(1, 0, 1)"=c(1, 0, 1), 
                "c(2, 0, 0)"=c(2, 0, 0)
                )

sapply(pdqList, FUN=function(pdq) lapply("Boston MA", FUN=function(x) { tmp<- dewPDaily_vs_r21 %>% 
                                             arrange(date) %>%
                                             filter(fct_city==x) %>% 
                                             pull(delta_r21) %>% 
                                             arima(x=., order=unname(pdq)); tmp$sigma2}
                                         )
       ) %>% 
    as_tibble() %>% 
    t()
##                [,1]
## c(0, 0, 0) 27.48617
## c(1, 0, 0) 18.98769
## c(0, 0, 1) 18.41728
## c(1, 0, 1) 17.87572
## c(2, 0, 0) 18.14050

Sigma-squared for all cities is compiled across models:

pdqList <- list("c(0, 0, 0)"=c(0, 0, 0), 
                "c(1, 0, 0)"=c(1, 0, 0), 
                "c(0, 0, 1)"=c(0, 0, 1), 
                "c(1, 0, 1)"=c(1, 0, 1), 
                "c(2, 0, 0)"=c(2, 0, 0)
                )

pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities, 
                                                   FUN=function(x) { 
                                                       tmp<- dewPDaily_vs_r21 %>% 
                                                           arrange(date) %>%
                                                           filter(fct_city==x) %>% 
                                                           pull(delta_r21) %>% 
                                                           arima(x=., order=unname(pdq)) 
                                                       tmp$sigma2
                                                       }
                                                   )
                  ) %>%
    t() %>%
    as.data.frame() %>%
    rownames_to_column("ARIMA") %>%
    as_tibble()

pdqAll
## # A tibble: 5 × 8
##   ARIMA    `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
##   <chr>          <dbl>        <dbl>        <dbl>          <dbl>            <dbl>
## 1 c(0, 0,…        27.5         28.0         27.2           34.0            25.1 
## 2 c(1, 0,…        19.0         14.7         15.8           12.3            10.3 
## 3 c(0, 0,…        18.4         15.8         15.1           16.4            12.3 
## 4 c(1, 0,…        17.9         13.8         13.7           11.6             9.45
## 5 c(2, 0,…        18.1         14.0         14.0           11.7             9.62
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>

Sigma-squared for all cities is plotted:

pdqAll %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily dewpoint vs. long-term 21-day mean for day of year)"
         )

The auto.arima() process is run to estimate optimal parameters for all cities:

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) dewPDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       forecast::auto.arima(d=0)
                   )
names(tstARIMA) <- dewCities

The auto.arima() parameters are pulled for all cities:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t()
##                ar ma d  sig2
## Boston MA       1  3 0 17.82
## Chicago IL      3  2 0 13.79
## Houston TX      1  3 0 13.65
## Las Vegas NV    1  1 0 11.55
## Los Angeles CA  2  2 0  9.42
## Miami FL        5  5 0  4.10
## New York NY     2  3 0 16.72

Sigma-squared for all cities is updated:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily dewpoint vs. long-term 21-day mean for day of year)"
         )

Sigma-squared for all cities is updated, incuding the baseline:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    bind_rows(dewPDaily %>% 
                  group_by(fct_city) %>% 
                  summarize(value=var(dewpoint_2m_mean)) %>% 
                  mutate(ARIMA="base") %>% 
                  pivot_wider(id_cols="ARIMA", names_from="fct_city")
              ) %>%
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily dewpoint vs. long-term 21-day mean for day of year)",
         caption="base=daily dewpoint (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
    )

Lag/lead for dewpoint vs. 21-day mean is plotted:

dewPDaily_vs_r21 %>% 
    arrange(fct_city, date) %>% 
    group_by(fct_city) %>% 
    mutate(dewNext=lead(delta_r21)) %>% 
    ungroup() %>% 
    filter(complete.cases(.)) %>% 
    count(fct_city, delta_r21=round(delta_r21), dewNext=round(dewNext)) %>% 
    ggplot(aes(x=delta_r21, y=dewNext)) + 
    geom_point(aes(size=n), alpha=0.1) + 
    facet_wrap(~fct_city) + 
    geom_smooth(aes(weight=n), method="lm") + 
    annotate("point", x=0, y=0, color="red") + 
    geom_abline(intercept=0, slope=1, color="red", lty=2) + 
    labs(x="Dewpoint vs. long-term 21-day mean", 
         y="Next day's dewpoint vs. long-term 21-day mean", 
         title="Dewpoint vs. long-term 21-day mean by day of year and city"
         ) + 
    scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'

Daily mean temperature data are converted to deviation from long-term rolling 21-day mean by day of year:

tempDaily_vs_r21 <- allCityDaily %>% 
    mutate(doy=pmin(yday(date), 365), 
           year=year(date), 
           meanTemp=0.5*(temperature_2m_max+temperature_2m_min), 
           fct_city=factor(cityName)
           ) %>% 
    select(fct_city, date, doy, year, meanTemp) %>%
    left_join(allCityDaily %>% 
                  mutate(doy=pmin(yday(date),365), 
                         meanTemp=0.5*(temperature_2m_max+temperature_2m_min),
                         fct_city=factor(cityName)
                         ) %>% 
                  arrange(date) %>%
                  group_by(fct_city) %>% 
                  helperRollingAgg(origVar="meanTemp", newName="temp_r21", k=21) %>% 
                  group_by(fct_city, doy) %>% 
                  summarize(temp_r21=mean(temp_r21, na.rm=TRUE), .groups="drop"), 
              by=c("fct_city", "doy")
              ) %>% 
    mutate(delta_r21=meanTemp-temp_r21)

tempDaily_vs_r21
## # A tibble: 53,855 × 7
##    fct_city  date         doy  year meanTemp temp_r21 delta_r21
##    <fct>     <date>     <dbl> <dbl>    <dbl>    <dbl>     <dbl>
##  1 Boston MA 2010-01-01     1  2010     0.5   0.112       0.388
##  2 Boston MA 2010-01-02     2  2010    -2.55  0.0786     -2.63 
##  3 Boston MA 2010-01-03     3  2010    -5.15  0.00934    -5.16 
##  4 Boston MA 2010-01-04     4  2010    -2.5  -0.152      -2.35 
##  5 Boston MA 2010-01-05     5  2010    -3.15 -0.338      -2.81 
##  6 Boston MA 2010-01-06     6  2010    -3    -0.396      -2.60 
##  7 Boston MA 2010-01-07     7  2010    -1.65 -0.486      -1.16 
##  8 Boston MA 2010-01-08     8  2010    -4.6  -0.552      -4.05 
##  9 Boston MA 2010-01-09     9  2010    -8.5  -0.567      -7.93 
## 10 Boston MA 2010-01-10    10  2010    -9.05 -0.608      -8.44 
## # ℹ 53,845 more rows
summary(tempDaily_vs_r21)
##            fct_city          date                 doy             year     
##  Boston MA     : 5113   Min.   :1960-01-01   Min.   :  1.0   Min.   :1960  
##  Chicago IL    :23376   1st Qu.:1996-11-10   1st Qu.: 91.0   1st Qu.:1996  
##  Houston TX    : 5113   Median :2013-05-22   Median :183.0   Median :2013  
##  Las Vegas NV  : 5113   Mean   :2006-02-14   Mean   :182.8   Mean   :2006  
##  Los Angeles CA: 5113   3rd Qu.:2018-08-28   3rd Qu.:274.0   3rd Qu.:2018  
##  Miami FL      : 5113   Max.   :2023-12-31   Max.   :365.0   Max.   :2023  
##  New York NY   : 4914                                                      
##     meanTemp         temp_r21        delta_r21         
##  Min.   :-29.00   Min.   :-5.275   Min.   :-2.388e+01  
##  1st Qu.:  6.75   1st Qu.: 6.956   1st Qu.:-2.371e+00  
##  Median : 16.40   Median :16.192   Median : 2.568e-02  
##  Mean   : 14.53   Mean   :14.532   Mean   :-8.365e-04  
##  3rd Qu.: 23.30   3rd Qu.:22.876   3rd Qu.: 2.402e+00  
##  Max.   : 39.75   Max.   :32.763   Max.   : 1.863e+01  
## 

A baseline ARIMA model is explored, starting with c(0, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) tempDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
               se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName         intercept     se sigma2 sigma
##   <chr>                <dbl>  <dbl>  <dbl> <dbl>
## 1 Miami FL       -0.0182     0.0293   4.40  2.10
## 2 Los Angeles CA  0.00593    0.0453  10.5   3.24
## 3 Las Vegas NV    0.00745    0.0473  11.4   3.38
## 4 Houston TX     -0.0135     0.0517  13.7   3.69
## 5 New York NY    -0.00916    0.0540  14.3   3.79
## 6 Boston MA       0.00000680 0.0556  15.8   3.97
## 7 Chicago IL      0.00400    0.0303  21.5   4.64

Lag/lead for temperature vs. 21-day mean is plotted:

tempDaily_vs_r21 %>% 
    arrange(fct_city, date) %>% 
    group_by(fct_city) %>% 
    mutate(tempNext=lead(delta_r21)) %>% 
    ungroup() %>% 
    filter(complete.cases(.)) %>% 
    count(fct_city, delta_r21=round(delta_r21), tempNext=round(tempNext)) %>% 
    ggplot(aes(x=delta_r21, y=tempNext)) + 
    geom_point(aes(size=n), alpha=0.1) + 
    facet_wrap(~fct_city) + 
    geom_smooth(aes(weight=n), method="lm") + 
    annotate("point", x=0, y=0, color="red") + 
    geom_abline(intercept=0, slope=1, color="red", lty=2) + 
    labs(x="Temperature vs. long-term 21-day mean", 
         y="Next day's temperature vs. long-term 21-day mean", 
         title="Temperature vs. long-term 21-day mean by day of year and city"
         ) + 
    scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'

The ARIMA model is explored at c(1, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) tempDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ar1   intercept
##   <chr>           <dbl> <dbl> <dbl>       <dbl>
## 1 Miami FL         2.12  1.46 0.721 -0.0183    
## 2 Los Angeles CA   3.31  1.82 0.827  0.00592   
## 3 Las Vegas NV     4.31  2.08 0.789  0.00746   
## 4 Houston TX       6.44  2.54 0.727 -0.0134    
## 5 New York NY      7.88  2.81 0.671 -0.00917   
## 6 Boston MA        9.17  3.03 0.648  0.00000623
## 7 Chicago IL       9.74  3.12 0.740  0.00400

The ARIMA model is explored at c(0, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) tempDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ma1  intercept
##   <chr>           <dbl> <dbl> <dbl>      <dbl>
## 1 Miami FL         2.38  1.54 0.696 -0.0181   
## 2 Los Angeles CA   4.70  2.17 0.758  0.00586  
## 3 Las Vegas NV     5.66  2.38 0.718  0.00737  
## 4 Houston TX       6.95  2.64 0.725 -0.0135   
## 5 New York NY      8.61  2.93 0.636 -0.00914  
## 6 Boston MA        9.58  3.10 0.638  0.0000114
## 7 Chicago IL      11.3   3.36 0.708  0.00400

The ARIMA model is explored at c(1, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) tempDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma   ar1   ma1  intercept
##   <chr>           <dbl> <dbl> <dbl> <dbl>      <dbl>
## 1 Miami FL         1.99  1.41 0.559 0.356 -0.0183   
## 2 Los Angeles CA   2.96  1.72 0.728 0.366  0.00589  
## 3 Las Vegas NV     4.06  2.01 0.685 0.298  0.00754  
## 4 Houston TX       5.70  2.39 0.542 0.451 -0.0134   
## 5 New York NY      7.61  2.76 0.511 0.297 -0.00923  
## 6 Boston MA        8.70  2.95 0.449 0.358 -0.0000145
## 7 Chicago IL       9.21  3.03 0.589 0.346  0.00401

The ARIMA model is explored at c(2, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) tempDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(2, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma   ar1    ar2  intercept
##   <chr>           <dbl> <dbl> <dbl>  <dbl>      <dbl>
## 1 Miami FL         2.02  1.42 0.877 -0.217 -0.0181   
## 2 Los Angeles CA   2.95  1.72 1.10  -0.330  0.00578  
## 3 Las Vegas NV     4.06  2.02 0.978 -0.239  0.00747  
## 4 Houston TX       5.82  2.41 0.952 -0.310 -0.0133   
## 5 New York NY      7.68  2.77 0.779 -0.162 -0.00904  
## 6 Boston MA        8.82  2.97 0.774 -0.195  0.0000453
## 7 Chicago IL       9.35  3.06 0.888 -0.200  0.00400

Sigma-squared for all cities is compiled across models:

pdqList <- list("c(0, 0, 0)"=c(0, 0, 0), 
                "c(1, 0, 0)"=c(1, 0, 0), 
                "c(0, 0, 1)"=c(0, 0, 1), 
                "c(1, 0, 1)"=c(1, 0, 1), 
                "c(2, 0, 0)"=c(2, 0, 0)
                )

pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities, 
                                                   FUN=function(x) { 
                                                       tmp<- tempDaily_vs_r21 %>% 
                                                           arrange(date) %>%
                                                           filter(fct_city==x) %>% 
                                                           pull(delta_r21) %>% 
                                                           arima(x=., order=unname(pdq)) 
                                                       tmp$sigma2
                                                       }
                                                   )
                  ) %>%
    t() %>%
    as.data.frame() %>%
    rownames_to_column("ARIMA") %>%
    as_tibble()

pdqAll
## # A tibble: 5 × 8
##   ARIMA    `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
##   <chr>          <dbl>        <dbl>        <dbl>          <dbl>            <dbl>
## 1 c(0, 0,…       15.8         21.5         13.6           11.4             10.5 
## 2 c(1, 0,…        9.16         9.74         6.44           4.31             3.31
## 3 c(0, 0,…        9.58        11.3          6.95           5.66             4.70
## 4 c(1, 0,…        8.70         9.21         5.69           4.06             2.96
## 5 c(2, 0,…        8.82         9.35         5.82           4.06             2.95
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>

Sigma-squared for all cities is plotted:

pdqAll %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily temperature vs. long-term 21-day mean for day of year)"
         )

The auto.arima() process is run to estimate optimal parameters for all cities:

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) tempDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       forecast::auto.arima(d=0)
                   )
names(tstARIMA) <- dewCities

The auto.arima() parameters are pulled for all cities:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t()
##                ar ma d sig2
## Boston MA       3  3 0 8.63
## Chicago IL      2  4 0 9.12
## Houston TX      4  1 0 5.63
## Las Vegas NV    1  4 0 4.05
## Los Angeles CA  1  5 0 2.94
## Miami FL        2  3 0 1.97
## New York NY     2  3 0 7.51

Sigma-squared for all cities is updated:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily temperature vs. long-term 21-day mean for day of year)"
         )

Sigma-squared for all cities is updated, incuding the baseline:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    bind_rows(tempDaily_vs_r21 %>% 
                  group_by(fct_city) %>% 
                  summarize(value=var(meanTemp)) %>% 
                  mutate(ARIMA="base") %>% 
                  pivot_wider(id_cols="ARIMA", names_from="fct_city")
              ) %>%
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily temperature vs. long-term 21-day mean for day of year)",
         caption="base=daily temperature (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
    )

Daily maximum wind speed data are converted to deviation from long-term rolling 21-day mean by day of year:

windDaily_vs_r21 <- allCityDaily %>% 
    mutate(doy=pmin(yday(date), 365), 
           year=year(date), 
           maxWind=windspeed_10m_max, 
           fct_city=factor(cityName)
           ) %>% 
    select(fct_city, date, doy, year, maxWind) %>%
    left_join(allCityDaily %>% 
                  mutate(doy=pmin(yday(date),365), 
                         maxWind=windspeed_10m_max, 
                         fct_city=factor(cityName)
                         ) %>% 
                  arrange(date) %>%
                  group_by(fct_city) %>% 
                  helperRollingAgg(origVar="maxWind", newName="wind_r21", k=21) %>% 
                  group_by(fct_city, doy) %>% 
                  summarize(wind_r21=mean(wind_r21, na.rm=TRUE), .groups="drop"), 
              by=c("fct_city", "doy")
              ) %>% 
    mutate(delta_r21=maxWind-wind_r21)

windDaily_vs_r21
## # A tibble: 53,855 × 7
##    fct_city  date         doy  year maxWind wind_r21 delta_r21
##    <fct>     <date>     <dbl> <dbl>   <dbl>    <dbl>     <dbl>
##  1 Boston MA 2010-01-01     1  2010    13.1     22.5    -9.35 
##  2 Boston MA 2010-01-02     2  2010    34.7     22.6    12.1  
##  3 Boston MA 2010-01-03     3  2010    34.8     22.7    12.1  
##  4 Boston MA 2010-01-04     4  2010    28.1     22.6     5.50 
##  5 Boston MA 2010-01-05     5  2010    17.1     22.5    -5.40 
##  6 Boston MA 2010-01-06     6  2010    20.7     22.4    -1.69 
##  7 Boston MA 2010-01-07     7  2010    21.1     22.3    -1.18 
##  8 Boston MA 2010-01-08     8  2010    21.5     22.2    -0.738
##  9 Boston MA 2010-01-09     9  2010    22       22.3    -0.274
## 10 Boston MA 2010-01-10    10  2010    20       22.4    -2.41 
## # ℹ 53,845 more rows
summary(windDaily_vs_r21)
##            fct_city          date                 doy             year     
##  Boston MA     : 5113   Min.   :1960-01-01   Min.   :  1.0   Min.   :1960  
##  Chicago IL    :23376   1st Qu.:1996-11-10   1st Qu.: 91.0   1st Qu.:1996  
##  Houston TX    : 5113   Median :2013-05-22   Median :183.0   Median :2013  
##  Las Vegas NV  : 5113   Mean   :2006-02-14   Mean   :182.8   Mean   :2006  
##  Los Angeles CA: 5113   3rd Qu.:2018-08-28   3rd Qu.:274.0   3rd Qu.:2018  
##  Miami FL      : 5113   Max.   :2023-12-31   Max.   :365.0   Max.   :2023  
##  New York NY   : 4914                                                      
##     maxWind         wind_r21       delta_r21         
##  Min.   : 4.30   Min.   :11.69   Min.   :-18.591295  
##  1st Qu.:14.30   1st Qu.:17.66   1st Qu.: -4.386224  
##  Median :18.90   Median :20.46   Median : -0.685034  
##  Mean   :19.91   Mean   :19.91   Mean   : -0.003014  
##  3rd Qu.:24.50   3rd Qu.:23.23   3rd Qu.:  3.607297  
##  Max.   :97.00   Max.   :25.18   Max.   : 79.880612  
## 

Lag/lead for maximum wind speed vs. 21-day mean is plotted:

windDaily_vs_r21 %>% 
    arrange(fct_city, date) %>% 
    group_by(fct_city) %>% 
    mutate(windNext=lead(delta_r21)) %>% 
    ungroup() %>% 
    filter(complete.cases(.)) %>% 
    count(fct_city, delta_r21=round(delta_r21), windNext=round(windNext)) %>% 
    ggplot(aes(x=delta_r21, y=windNext)) + 
    geom_point(aes(size=n), alpha=0.1) + 
    facet_wrap(~fct_city) + 
    geom_smooth(aes(weight=n), method="lm") + 
    annotate("point", x=0, y=0, color="red") + 
    geom_abline(intercept=0, slope=1, color="red", lty=2) + 
    labs(x="Maximum wind speed vs. long-term 21-day mean", 
         y="Next day's maximum wind speed vs. long-term 21-day mean", 
         title="Maximum wind speed vs. long-term 21-day mean by day of year and city"
         ) + 
    scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'

A baseline ARIMA model is explored, starting with c(0, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) windDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
               se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       intercept     se sigma2 sigma
##   <chr>              <dbl>  <dbl>  <dbl> <dbl>
## 1 Los Angeles CA -0.0117   0.0555   15.8  3.97
## 2 New York NY     0.00404  0.0812   32.4  5.69
## 3 Miami FL        0.00896  0.0827   34.9  5.91
## 4 Houston TX     -0.00981  0.0827   35.0  5.91
## 5 Boston MA      -0.0122   0.0879   39.5  6.29
## 6 Chicago IL      0.000733 0.0454   48.2  6.95
## 7 Las Vegas NV   -0.0141   0.111    62.7  7.92

The ARIMA model is explored at c(1, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) windDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ar1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Los Angeles CA   12.3  3.51 0.468 -0.0117  
## 2 Miami FL         24.9  4.99 0.536  0.00885 
## 3 New York NY      28.7  5.36 0.336  0.00409 
## 4 Houston TX       28.9  5.38 0.416 -0.00984 
## 5 Boston MA        35.7  5.98 0.311 -0.0122  
## 6 Chicago IL       41.7  6.46 0.368  0.000734
## 7 Las Vegas NV     48.6  6.97 0.474 -0.0141

The ARIMA model is explored at c(0, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) windDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ma1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Los Angeles CA   12.8  3.57 0.419 -0.0118  
## 2 Miami FL         25.9  5.09 0.499  0.00896 
## 3 New York NY      28.2  5.31 0.382  0.00398 
## 4 Houston TX       29.0  5.38 0.415 -0.00976 
## 5 Boston MA        35.3  5.94 0.345 -0.0124  
## 6 Chicago IL       41.4  6.44 0.383  0.000734
## 7 Las Vegas NV     50.4  7.10 0.426 -0.0142

The ARIMA model is explored at c(1, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) windDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma    ar1     ma1 intercept
##   <chr>           <dbl> <dbl>  <dbl>   <dbl>     <dbl>
## 1 Los Angeles CA   12.3  3.51 0.476  -0.0104 -0.0116  
## 2 Miami FL         24.7  4.97 0.418   0.166   0.00853 
## 3 New York NY      28.2  5.31 0.0118  0.372   0.00451 
## 4 Houston TX       28.7  5.35 0.234   0.222  -0.00996 
## 5 Boston MA        35.3  5.94 0.0187  0.328  -0.0122  
## 6 Chicago IL       41.3  6.43 0.144   0.261   0.000744
## 7 Las Vegas NV     48.6  6.97 0.463   0.0145 -0.0140

The ARIMA model is explored at c(2, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) windDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(2, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma   ar1      ar2 intercept
##   <chr>           <dbl> <dbl> <dbl>    <dbl>     <dbl>
## 1 Los Angeles CA   12.3  3.51 0.466  0.00332 -0.0117  
## 2 Miami FL         24.7  4.97 0.583 -0.0893   0.00861 
## 3 New York NY      28.3  5.32 0.378 -0.124    0.00395 
## 4 Houston TX       28.7  5.36 0.452 -0.0869  -0.00971 
## 5 Boston MA        35.3  5.94 0.344 -0.108   -0.0124  
## 6 Chicago IL       41.4  6.43 0.401 -0.0896   0.000733
## 7 Las Vegas NV     48.6  6.97 0.476 -0.00481 -0.0140

Sigma-squared for all cities is compiled across models:

pdqList <- list("c(0, 0, 0)"=c(0, 0, 0), 
                "c(1, 0, 0)"=c(1, 0, 0), 
                "c(0, 0, 1)"=c(0, 0, 1), 
                "c(1, 0, 1)"=c(1, 0, 1), 
                "c(2, 0, 0)"=c(2, 0, 0)
                )

pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities, 
                                                   FUN=function(x) { 
                                                       tmp<- windDaily_vs_r21 %>% 
                                                           arrange(date) %>%
                                                           filter(fct_city==x) %>% 
                                                           pull(delta_r21) %>% 
                                                           arima(x=., order=unname(pdq)) 
                                                       tmp$sigma2
                                                       }
                                                   )
                  ) %>%
    t() %>%
    as.data.frame() %>%
    rownames_to_column("ARIMA") %>%
    as_tibble()

pdqAll
## # A tibble: 5 × 8
##   ARIMA    `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
##   <chr>          <dbl>        <dbl>        <dbl>          <dbl>            <dbl>
## 1 c(0, 0,…        39.5         48.2         35.0           62.6             15.7
## 2 c(1, 0,…        35.7         41.7         28.9           48.6             12.3
## 3 c(0, 0,…        35.3         41.4         29.0           50.4             12.8
## 4 c(1, 0,…        35.3         41.3         28.7           48.6             12.3
## 5 c(2, 0,…        35.3         41.4         28.7           48.6             12.3
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>

Sigma-squared for all cities is plotted:

pdqAll %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily maximum wind speed vs. long-term 21-day mean for day of year)"
         )

The auto.arima() process is run to estimate optimal parameters for all cities:

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) windDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       forecast::auto.arima(d=0)
                   )
names(tstARIMA) <- dewCities

The auto.arima() parameters are pulled for all cities:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t()
##                ar ma d  sig2
## Boston MA       2  2 0 35.31
## Chicago IL      1  1 0 41.32
## Houston TX      5  0 0 28.54
## Las Vegas NV    1  0 0 48.57
## Los Angeles CA  1  0 0 12.30
## Miami FL        1  1 0 24.73
## New York NY     4  2 0 28.20

Sigma-squared for all cities is updated:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily maximum wind speed vs. long-term 21-day mean for day of year)"
         )

Sigma-squared for all cities is updated, incuding the baseline:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    bind_rows(windDaily_vs_r21 %>% 
                  group_by(fct_city) %>% 
                  summarize(value=var(maxWind)) %>% 
                  mutate(ARIMA="base") %>% 
                  pivot_wider(id_cols="ARIMA", names_from="fct_city")
              ) %>%
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily maximum wind speed vs. long-term 21-day mean for day of year)",
         caption="base=daily max wind speed (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
    )

Daily precipitation data are converted to deviation from long-term rolling 21-day mean by day of year:

precipDaily_vs_r21 <- allCityDaily %>% 
    mutate(doy=pmin(yday(date), 365), 
           year=year(date), 
           sumPrecip=precipitation_sum, 
           fct_city=factor(cityName)
           ) %>% 
    select(fct_city, date, doy, year, sumPrecip) %>%
    left_join(allCityDaily %>% 
                  mutate(doy=pmin(yday(date),365), 
                         sumPrecip=precipitation_sum, 
                         fct_city=factor(cityName)
                         ) %>% 
                  arrange(date) %>%
                  group_by(fct_city) %>% 
                  helperRollingAgg(origVar="sumPrecip", newName="precip_r21", k=21) %>% 
                  group_by(fct_city, doy) %>% 
                  summarize(precip_r21=mean(precip_r21, na.rm=TRUE), .groups="drop"), 
              by=c("fct_city", "doy")
              ) %>% 
    mutate(delta_r21=sumPrecip-precip_r21)

precipDaily_vs_r21
## # A tibble: 53,855 × 7
##    fct_city  date         doy  year sumPrecip precip_r21 delta_r21
##    <fct>     <date>     <dbl> <dbl>     <dbl>      <dbl>     <dbl>
##  1 Boston MA 2010-01-01     1  2010       0.5       3.17     -2.67
##  2 Boston MA 2010-01-02     2  2010       8.9       3.42      5.48
##  3 Boston MA 2010-01-03     3  2010       6.7       3.22      3.48
##  4 Boston MA 2010-01-04     4  2010       0         3.19     -3.19
##  5 Boston MA 2010-01-05     5  2010       0         2.91     -2.91
##  6 Boston MA 2010-01-06     6  2010       0         3.02     -3.02
##  7 Boston MA 2010-01-07     7  2010       0         2.89     -2.89
##  8 Boston MA 2010-01-08     8  2010       0.2       3.08     -2.88
##  9 Boston MA 2010-01-09     9  2010       0         2.79     -2.79
## 10 Boston MA 2010-01-10    10  2010       0         2.93     -2.93
## # ℹ 53,845 more rows
summary(precipDaily_vs_r21)
##            fct_city          date                 doy             year     
##  Boston MA     : 5113   Min.   :1960-01-01   Min.   :  1.0   Min.   :1960  
##  Chicago IL    :23376   1st Qu.:1996-11-10   1st Qu.: 91.0   1st Qu.:1996  
##  Houston TX    : 5113   Median :2013-05-22   Median :183.0   Median :2013  
##  Las Vegas NV  : 5113   Mean   :2006-02-14   Mean   :182.8   Mean   :2006  
##  Los Angeles CA: 5113   3rd Qu.:2018-08-28   3rd Qu.:274.0   3rd Qu.:2018  
##  Miami FL      : 5113   Max.   :2023-12-31   Max.   :365.0   Max.   :2023  
##  New York NY   : 4914                                                      
##    sumPrecip         precip_r21        delta_r21         
##  Min.   :  0.000   Min.   :0.01599   Min.   : -7.334694  
##  1st Qu.:  0.000   1st Qu.:2.01905   1st Qu.: -2.958503  
##  Median :  0.000   Median :2.85204   Median : -1.936979  
##  Mean   :  2.601   Mean   :2.60069   Mean   : -0.000175  
##  3rd Qu.:  1.800   3rd Qu.:3.26094   3rd Qu.: -0.162245  
##  Max.   :143.900   Max.   :7.33469   Max.   :138.164626  
## 

Lag/lead for precipitation vs. 21-day mean is plotted:

precipDaily_vs_r21 %>% 
    arrange(fct_city, date) %>% 
    group_by(fct_city) %>% 
    mutate(precipNext=lead(delta_r21)) %>% 
    ungroup() %>% 
    filter(complete.cases(.)) %>% 
    count(fct_city, delta_r21=round(delta_r21), precipNext=round(precipNext)) %>% 
    ggplot(aes(x=delta_r21, y=precipNext)) + 
    geom_point(aes(size=n), alpha=0.1) + 
    facet_wrap(~fct_city) + 
    geom_smooth(aes(weight=n), method="lm") + 
    annotate("point", x=0, y=0, color="red") + 
    geom_abline(intercept=0, slope=1, color="red", lty=2) + 
    labs(x="Precipitation vs. long-term 21-day mean", 
         y="Next day's precipitation vs. long-term 21-day mean", 
         title="Precipitation vs. long-term 21-day mean by day of year and city"
         ) + 
    scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'

A baseline ARIMA model is explored, starting with c(0, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) precipDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
               se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       intercept     se sigma2 sigma
##   <chr>              <dbl>  <dbl>  <dbl> <dbl>
## 1 Las Vegas NV   -0.00184  0.0263   3.53  1.88
## 2 Los Angeles CA  0.00116  0.0722  26.7   5.16
## 3 Chicago IL      0.000684 0.0398  37.1   6.09
## 4 New York NY    -0.00681  0.101   49.9   7.06
## 5 Boston MA      -0.000555 0.104   54.9   7.41
## 6 Miami FL        0.00843  0.108   60.1   7.75
## 7 Houston TX     -0.00562  0.140  101.   10.0

The ARIMA model is explored at c(1, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) precipDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ar1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Las Vegas NV     3.36  1.83 0.224 -0.00184 
## 2 Los Angeles CA  24.6   4.96 0.275  0.00118 
## 3 Chicago IL      36.3   6.03 0.145  0.000685
## 4 New York NY     49.1   7.01 0.126 -0.00680 
## 5 Miami FL        54.1   7.36 0.315  0.00842 
## 6 Boston MA       54.3   7.37 0.102 -0.000531
## 7 Houston TX      89.3   9.45 0.336 -0.00560

The ARIMA model is explored at c(0, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) precipDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ma1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Las Vegas NV     3.37  1.84 0.205 -0.00185 
## 2 Los Angeles CA  24.8   4.98 0.255  0.00113 
## 3 Chicago IL      36.3   6.03 0.149  0.000684
## 4 New York NY     49.0   7.00 0.144 -0.00682 
## 5 Boston MA       54.2   7.36 0.115 -0.000567
## 6 Miami FL        54.4   7.38 0.304  0.00843 
## 7 Houston TX      90.6   9.52 0.300 -0.00565

The ARIMA model is explored at c(1, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) precipDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma     ar1     ma1 intercept
##   <chr>           <dbl> <dbl>   <dbl>   <dbl>     <dbl>
## 1 Las Vegas NV     3.36  1.83  0.315  -0.0952 -0.00173 
## 2 Los Angeles CA  24.6   4.96  0.376  -0.110   0.00170 
## 3 Chicago IL      36.3   6.03 -0.0159  0.164   0.000715
## 4 New York NY     48.9   6.99 -0.288   0.426  -0.00709 
## 5 Boston MA       54.1   7.36 -0.343   0.454  -0.000277
## 6 Miami FL        54.1   7.36  0.261   0.0599  0.00833 
## 7 Houston TX      89.2   9.45  0.398  -0.0702 -0.00529

The ARIMA model is explored at c(2, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) precipDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(2, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma   ar1     ar2 intercept
##   <chr>           <dbl> <dbl> <dbl>   <dbl>     <dbl>
## 1 Las Vegas NV     3.36  1.83 0.219  0.0234 -0.00183 
## 2 Los Angeles CA  24.6   4.96 0.270  0.0209  0.00122 
## 3 Chicago IL      36.3   6.03 0.148 -0.0217  0.000685
## 4 New York NY     48.8   6.99 0.136 -0.0719 -0.00678 
## 5 Boston MA       54.1   7.35 0.109 -0.0632 -0.000604
## 6 Miami FL        54.1   7.36 0.319 -0.0141  0.00845 
## 7 Houston TX      89.2   9.45 0.328  0.0218 -0.00552

Sigma-squared for all cities is compiled across models:

pdqList <- list("c(0, 0, 0)"=c(0, 0, 0), 
                "c(1, 0, 0)"=c(1, 0, 0), 
                "c(0, 0, 1)"=c(0, 0, 1), 
                "c(1, 0, 1)"=c(1, 0, 1), 
                "c(2, 0, 0)"=c(2, 0, 0)
                )

pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities, 
                                                   FUN=function(x) { 
                                                       tmp<- precipDaily_vs_r21 %>% 
                                                           arrange(date) %>%
                                                           filter(fct_city==x) %>% 
                                                           pull(delta_r21) %>% 
                                                           arima(x=., order=unname(pdq)) 
                                                       tmp$sigma2
                                                       }
                                                   )
                  ) %>%
    t() %>%
    as.data.frame() %>%
    rownames_to_column("ARIMA") %>%
    as_tibble()

pdqAll
## # A tibble: 5 × 8
##   ARIMA    `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
##   <chr>          <dbl>        <dbl>        <dbl>          <dbl>            <dbl>
## 1 c(0, 0,…        54.8         37.1        101.            3.53             26.6
## 2 c(1, 0,…        54.3         36.3         89.3           3.36             24.6
## 3 c(0, 0,…        54.2         36.3         90.6           3.37             24.8
## 4 c(1, 0,…        54.1         36.3         89.2           3.35             24.6
## 5 c(2, 0,…        54.1         36.3         89.2           3.35             24.6
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>

Sigma-squared for all cities is plotted:

pdqAll %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily precipitation vs. long-term 21-day mean for day of year)"
         )

The auto.arima() process is run to estimate optimal parameters for all cities:

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) precipDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       forecast::auto.arima(d=0)
                   )
names(tstARIMA) <- dewCities

The auto.arima() parameters are pulled for all cities:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t()
##                ar ma d  sig2
## Boston MA       0  2 0 54.08
## Chicago IL      0  1 0 36.32
## Houston TX      1  1 0 89.24
## Las Vegas NV    0  2 0  3.36
## Los Angeles CA  1  2 0 24.55
## Miami FL        1  2 0 54.07
## New York NY     1  2 0 48.84

Sigma-squared for all cities is updated:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily precipitation vs. long-term 21-day mean for day of year)"
         )

Sigma-squared for all cities is updated, incuding the baseline:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    bind_rows(precipDaily_vs_r21 %>% 
                  group_by(fct_city) %>% 
                  summarize(value=var(sumPrecip)) %>% 
                  mutate(ARIMA="base") %>% 
                  pivot_wider(id_cols="ARIMA", names_from="fct_city")
              ) %>%
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily precipitation vs. long-term 21-day mean for day of year)",
         caption="base=daily precipitation (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
    )

Daily shortwave radiation data are converted to deviation from long-term rolling 21-day mean by day of year:

swrDaily_vs_r21 <- allCityDaily %>% 
    mutate(doy=pmin(yday(date), 365), 
           year=year(date), 
           sumSWR=shortwave_radiation_sum, 
           fct_city=factor(cityName)
           ) %>% 
    select(fct_city, date, doy, year, sumSWR) %>%
    left_join(allCityDaily %>% 
                  mutate(doy=pmin(yday(date),365), 
                         sumSWR=shortwave_radiation_sum, 
                         fct_city=factor(cityName)
                         ) %>% 
                  arrange(date) %>%
                  group_by(fct_city) %>% 
                  helperRollingAgg(origVar="sumSWR", newName="swr_r21", k=21) %>% 
                  group_by(fct_city, doy) %>% 
                  summarize(swr_r21=mean(swr_r21, na.rm=TRUE), .groups="drop"), 
              by=c("fct_city", "doy")
              ) %>% 
    mutate(delta_r21=sumSWR-swr_r21)

swrDaily_vs_r21
## # A tibble: 53,855 × 7
##    fct_city  date         doy  year sumSWR swr_r21 delta_r21
##    <fct>     <date>     <dbl> <dbl>  <dbl>   <dbl>     <dbl>
##  1 Boston MA 2010-01-01     1  2010   6.58    6.01    0.565 
##  2 Boston MA 2010-01-02     2  2010   3.86    6.03   -2.17  
##  3 Boston MA 2010-01-03     3  2010   4.04    6.13   -2.09  
##  4 Boston MA 2010-01-04     4  2010   7.66    6.12    1.54  
##  5 Boston MA 2010-01-05     5  2010   7.55    6.18    1.37  
##  6 Boston MA 2010-01-06     6  2010   8.26    6.18    2.08  
##  7 Boston MA 2010-01-07     7  2010   8.82    6.29    2.53  
##  8 Boston MA 2010-01-08     8  2010   6.22    6.31   -0.0854
##  9 Boston MA 2010-01-09     9  2010   8.85    6.39    2.46  
## 10 Boston MA 2010-01-10    10  2010   9.24    6.45    2.79  
## # ℹ 53,845 more rows
summary(swrDaily_vs_r21)
##            fct_city          date                 doy             year     
##  Boston MA     : 5113   Min.   :1960-01-01   Min.   :  1.0   Min.   :1960  
##  Chicago IL    :23376   1st Qu.:1996-11-10   1st Qu.: 91.0   1st Qu.:1996  
##  Houston TX    : 5113   Median :2013-05-22   Median :183.0   Median :2013  
##  Las Vegas NV  : 5113   Mean   :2006-02-14   Mean   :182.8   Mean   :2006  
##  Los Angeles CA: 5113   3rd Qu.:2018-08-28   3rd Qu.:274.0   3rd Qu.:2018  
##  Miami FL      : 5113   Max.   :2023-12-31   Max.   :365.0   Max.   :2023  
##  New York NY   : 4914                                                      
##      sumSWR         swr_r21         delta_r21         
##  Min.   : 0.40   Min.   : 5.561   Min.   :-2.321e+01  
##  1st Qu.: 9.99   1st Qu.:10.960   1st Qu.:-2.274e+00  
##  Median :16.20   Median :16.771   Median : 9.564e-01  
##  Mean   :16.31   Mean   :16.310   Mean   : 7.834e-04  
##  3rd Qu.:22.75   3rd Qu.:21.479   3rd Qu.: 2.923e+00  
##  Max.   :33.04   Max.   :30.328   Max.   : 1.053e+01  
## 

Lag/lead for shortwave radiation vs. 21-day mean is plotted:

swrDaily_vs_r21 %>% 
    arrange(fct_city, date) %>% 
    group_by(fct_city) %>% 
    mutate(swrNext=lead(delta_r21)) %>% 
    ungroup() %>% 
    filter(complete.cases(.)) %>% 
    count(fct_city, delta_r21=round(delta_r21), swrNext=round(swrNext)) %>% 
    ggplot(aes(x=delta_r21, y=swrNext)) + 
    geom_point(aes(size=n), alpha=0.1) + 
    facet_wrap(~fct_city) + 
    geom_smooth(aes(weight=n), method="lm") + 
    annotate("point", x=0, y=0, color="red") + 
    geom_abline(intercept=0, slope=1, color="red", lty=2) + 
    labs(x="Shortwave radiation vs. long-term 21-day mean", 
         y="Next day's shortwave radiation vs. long-term 21-day mean", 
         title="Shortwave radiation vs. long-term 21-day mean by day of year and city"
         ) + 
    scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'

A baseline ARIMA model is explored, starting with c(0, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) swrDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
               se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       intercept     se sigma2 sigma
##   <chr>              <dbl>  <dbl>  <dbl> <dbl>
## 1 Las Vegas NV    0.00123  0.0367   6.88  2.62
## 2 Los Angeles CA -0.00246  0.0466  11.1   3.33
## 3 Miami FL       -0.00172  0.0509  13.2   3.64
## 4 Chicago IL     -0.000810 0.0301  21.1   4.60
## 5 Houston TX      0.00746  0.0681  23.7   4.87
## 6 Boston MA       0.000330 0.0739  27.9   5.28
## 7 New York NY     0.00740  0.0772  29.3   5.41

The ARIMA model is explored at c(1, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) swrDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ar1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Las Vegas NV     5.62  2.37 0.429  0.00123 
## 2 Los Angeles CA   8.58  2.93 0.476 -0.00247 
## 3 Miami FL        10.7   3.27 0.440 -0.00170 
## 4 Houston TX      18.8   4.33 0.456  0.00746 
## 5 Chicago IL      19.8   4.45 0.251 -0.000811
## 6 Boston MA       26.5   5.15 0.228  0.000330
## 7 New York NY     27.7   5.26 0.238  0.00741

The ARIMA model is explored at c(0, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) swrDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(0, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 5
##   cityName       sigma2 sigma   ma1 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>
## 1 Las Vegas NV     5.85  2.42 0.360  0.00123 
## 2 Los Angeles CA   8.98  3.00 0.414 -0.00249 
## 3 Miami FL        11.0   3.32 0.382 -0.00171 
## 4 Houston TX      19.3   4.39 0.415  0.00747 
## 5 Chicago IL      19.9   4.46 0.245 -0.000811
## 6 Boston MA       26.5   5.14 0.232  0.000342
## 7 New York NY     27.6   5.26 0.240  0.00740

The ARIMA model is explored at c(1, 0, 1):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) swrDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(1, 0, 1))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma    ar1       ma1 intercept
##   <chr>           <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 Las Vegas NV     5.60  2.37 0.546  -0.145     0.00119 
## 2 Los Angeles CA   8.57  2.93 0.524  -0.0630   -0.00249 
## 3 Miami FL        10.7   3.27 0.440  -0.000566 -0.00150 
## 4 Houston TX      18.7   4.33 0.383   0.0913    0.00745 
## 5 Chicago IL      19.8   4.45 0.202   0.0528   -0.000826
## 6 Boston MA       26.4   5.14 0.0834  0.153     0.000308
## 7 New York NY     27.6   5.26 0.104   0.142     0.00755

The ARIMA model is explored at c(2, 0, 0):

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) swrDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       arima(order=c(2, 0, 0))
                   )
names(tstARIMA) <- dewCities

tibble::tibble(cityName=names(tstARIMA), 
               sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
               sigma=sqrt(sigma2)
               ) %>%
    inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
                   t() %>%
                   as.data.frame() %>%
                   rownames_to_column("cityName") %>%
                   as_tibble(), 
               by="cityName"
    ) %>%
    arrange(sigma)
## # A tibble: 7 × 6
##   cityName       sigma2 sigma   ar1       ar2 intercept
##   <chr>           <dbl> <dbl> <dbl>     <dbl>     <dbl>
## 1 Las Vegas NV     5.61  2.37 0.408  0.0491    0.00120 
## 2 Los Angeles CA   8.57  2.93 0.465  0.0215   -0.00251 
## 3 Miami FL        10.7   3.27 0.439  0.000340 -0.00174 
## 4 Houston TX      18.8   4.33 0.473 -0.0374    0.00740 
## 5 Chicago IL      19.8   4.45 0.254 -0.0120   -0.000809
## 6 Boston MA       26.4   5.14 0.236 -0.0344    0.000367
## 7 New York NY     27.6   5.26 0.246 -0.0338    0.00741

Sigma-squared for all cities is compiled across models:

pdqList <- list("c(0, 0, 0)"=c(0, 0, 0), 
                "c(1, 0, 0)"=c(1, 0, 0), 
                "c(0, 0, 1)"=c(0, 0, 1), 
                "c(1, 0, 1)"=c(1, 0, 1), 
                "c(2, 0, 0)"=c(2, 0, 0)
                )

pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities, 
                                                   FUN=function(x) { 
                                                       tmp<- swrDaily_vs_r21 %>% 
                                                           arrange(date) %>%
                                                           filter(fct_city==x) %>% 
                                                           pull(delta_r21) %>% 
                                                           arima(x=., order=unname(pdq)) 
                                                       tmp$sigma2
                                                       }
                                                   )
                  ) %>%
    t() %>%
    as.data.frame() %>%
    rownames_to_column("ARIMA") %>%
    as_tibble()

pdqAll
## # A tibble: 5 × 8
##   ARIMA    `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
##   <chr>          <dbl>        <dbl>        <dbl>          <dbl>            <dbl>
## 1 c(0, 0,…        27.9         21.1         23.7           6.88            11.1 
## 2 c(1, 0,…        26.5         19.8         18.8           5.62             8.58
## 3 c(0, 0,…        26.4         19.9         19.3           5.85             8.97
## 4 c(1, 0,…        26.4         19.8         18.7           5.60             8.57
## 5 c(2, 0,…        26.4         19.8         18.7           5.60             8.57
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>

Sigma-squared for all cities is plotted:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily shortwave radiation vs. long-term 21-day mean for day of year)"
         )

The auto.arima() process is run to estimate optimal parameters for all cities:

tstARIMA <- lapply(dewCities, 
                   FUN=function(x) swrDaily_vs_r21 %>% 
                       arrange(date) %>%
                       filter(fct_city==x) %>% 
                       pull(delta_r21) %>% 
                       forecast::auto.arima(d=0)
                   )
names(tstARIMA) <- dewCities

The auto.arima() parameters are pulled for all cities:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t()
##                ar ma d  sig2
## Boston MA       0  1 0 26.45
## Chicago IL      0  3 0 19.81
## Houston TX      1  3 0 18.60
## Las Vegas NV    2  1 0  5.51
## Los Angeles CA  3  1 0  8.47
## Miami FL        1  0 0 10.68
## New York NY     0  2 0 27.63

Sigma-squared for all cities is updated:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily shortwave radiation vs. long-term 21-day mean for day of year)"
         )

Sigma-squared for all cities is updated, incuding the baseline:

sapply(tstARIMA, 
       FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
       ) %>%
    t() %>% 
    as.data.frame() %>% 
    rownames_to_column("name") %>% 
    select(name, value=sig2) %>% 
    mutate(ARIMA="auto") %>% 
    pivot_wider(id_cols="ARIMA") %>% 
    bind_rows(pdqAll) %>% 
    bind_rows(swrDaily_vs_r21 %>% 
                  group_by(fct_city) %>% 
                  summarize(value=var(sumSWR)) %>% 
                  mutate(ARIMA="base") %>% 
                  pivot_wider(id_cols="ARIMA", names_from="fct_city")
              ) %>%
    pivot_longer(cols=-ARIMA, names_to="city") %>% 
    group_by(city) %>% 
    mutate(pct=value/max(value)) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) + 
    facet_wrap(~city) + 
    geom_col() + 
    theme(legend.position = "none") + 
    geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) + 
    labs(x="ARIMA(p, d, q)", 
         y="Sigma-squared\nrelative to ARIMA(0, 0, 0)", 
         title="Relative sigma-squared by model", 
         subtitle="(daily shortwave radiation vs. long-term 21-day mean for day of year)",
         caption="base=daily SWR (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
    )

ACF is calculated for shortwave radiation for each city:

dewCities <- dewPDaily %>% 
    pull(fct_city) %>% 
    as.vector() %>% 
    unique()

swrACF <- lapply(dewCities, 
                 FUN=function(x) {
                     swrDaily_vs_r21 %>% 
                         filter(fct_city==x) %>% 
                         arrange(date) %>% 
                         select(sumSWR) %>% 
                         acf(lag.max=2000, plot=FALSE)
                     }
                 )
names(swrACF) <- dewCities

swrACFAll <- map_dfr(.x=dewCities, 
                     .f=function(x) tibble::tibble(lag=as.vector(swrACF[[x]][["lag"]]), 
                                                   acf=as.vector(swrACF[[x]][["acf"]])
                                                   ), 
                     .id="src") %>% 
    mutate(cityName=dewCities[as.integer(src)])

swrACFAll
## # A tibble: 14,007 × 4
##    src     lag   acf cityName 
##    <chr> <dbl> <dbl> <chr>    
##  1 1         0 1     Boston MA
##  2 1         1 0.640 Boston MA
##  3 1         2 0.542 Boston MA
##  4 1         3 0.533 Boston MA
##  5 1         4 0.527 Boston MA
##  6 1         5 0.524 Boston MA
##  7 1         6 0.521 Boston MA
##  8 1         7 0.530 Boston MA
##  9 1         8 0.518 Boston MA
## 10 1         9 0.514 Boston MA
## # ℹ 13,997 more rows
swrACFAll %>%
    filter(lag > 0) %>%
    group_by(cityName) %>%
    summarize(mu_abs_acf=mean(abs(acf))) %>%
    arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Las Vegas NV        0.437
## 2 Chicago IL          0.399
## 3 Los Angeles CA      0.397
## 4 Boston MA           0.270
## 5 New York NY         0.246
## 6 Miami FL            0.243
## 7 Houston TX          0.229

ACF by city are plotted:

swrACFAll %>% 
    filter(lag>0) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="ACF", title="ACF of shortwave radiation by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

PACF is calculated for shortwave radiation for each city:

swrPACF <- lapply(dewCities, 
                  FUN=function(x) {
                      swrDaily_vs_r21 %>% 
                          filter(fct_city==x) %>% 
                          arrange(date) %>% 
                          select(sumSWR) %>% 
                          pacf(lag.max=2000, plot=FALSE)
                      }
                  )
names(swrPACF) <- dewCities

swrPACFAll <- map_dfr(.x=dewCities, 
                      .f=function(x) tibble::tibble(lag=as.vector(swrPACF[[x]][["lag"]]), 
                                                    pacf=as.vector(swrPACF[[x]][["acf"]])
                                                    ), 
                      .id="src") %>% 
    mutate(cityName=dewCities[as.integer(src)])

swrPACFAll
## # A tibble: 14,000 × 4
##    src     lag   pacf cityName 
##    <chr> <dbl>  <dbl> <chr>    
##  1 1         1 0.640  Boston MA
##  2 1         2 0.225  Boston MA
##  3 1         3 0.214  Boston MA
##  4 1         4 0.162  Boston MA
##  5 1         5 0.140  Boston MA
##  6 1         6 0.118  Boston MA
##  7 1         7 0.127  Boston MA
##  8 1         8 0.0780 Boston MA
##  9 1         9 0.0802 Boston MA
## 10 1        10 0.107  Boston MA
## # ℹ 13,990 more rows
swrPACFAll %>%
    filter(lag > 0) %>%
    group_by(cityName) %>%
    summarize(mu_abs_pacf=mean(abs(pacf))) %>%
    arrange(desc(mu_abs_pacf))
## # A tibble: 7 × 2
##   cityName       mu_abs_pacf
##   <chr>                <dbl>
## 1 Las Vegas NV       0.0114 
## 2 Boston MA          0.0114 
## 3 Houston TX         0.0113 
## 4 New York NY        0.0112 
## 5 Los Angeles CA     0.0111 
## 6 Miami FL           0.0109 
## 7 Chicago IL         0.00729

PACF by city are plotted:

swrPACFAll %>% 
    filter(lag>0) %>%
    ggplot(aes(x=lag, y=pacf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="PACF", title="PACF of shortwave radiation by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

swrPACFAll %>% 
    filter(lag>0, lag<=50) %>%
    ggplot(aes(x=lag, y=pacf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="PACF", title="PACF of shortwave radiation by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

ACF is calculated and plotted for seasonally-adjusted shortwave radiation for each city:

swrACF_r21 <- lapply(dewCities, 
                     FUN=function(x) {
                         swrDaily_vs_r21 %>% 
                             filter(fct_city==x) %>% 
                             arrange(date) %>% 
                             select(delta_r21) %>% 
                             acf(lag.max=2000, plot=FALSE)
                         }
                     )
names(swrACF_r21) <- dewCities

swrACFAll_r21 <- map_dfr(.x=dewCities, 
                         .f=function(x) tibble::tibble(lag=as.vector(swrACF_r21[[x]][["lag"]]), 
                                                       acf=as.vector(swrACF_r21[[x]][["acf"]])
                                                       ), 
                         .id="src") %>% 
    mutate(cityName=dewCities[as.integer(src)])

swrACFAll_r21
## # A tibble: 14,007 × 4
##    src     lag       acf cityName 
##    <chr> <dbl>     <dbl> <chr>    
##  1 1         0  1        Boston MA
##  2 1         1  0.228    Boston MA
##  3 1         2  0.0193   Boston MA
##  4 1         3  0.000369 Boston MA
##  5 1         4 -0.0106   Boston MA
##  6 1         5 -0.0151   Boston MA
##  7 1         6 -0.0192   Boston MA
##  8 1         7  0.00412  Boston MA
##  9 1         8 -0.0189   Boston MA
## 10 1         9 -0.0232   Boston MA
## # ℹ 13,997 more rows
swrACFAll_r21 %>%
    filter(lag > 0) %>%
    group_by(cityName) %>%
    summarize(mu_abs_acf=mean(abs(acf))) %>%
    arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Las Vegas NV      0.0196 
## 2 Los Angeles CA    0.0161 
## 3 Houston TX        0.0148 
## 4 Miami FL          0.0124 
## 5 Boston MA         0.0108 
## 6 New York NY       0.0107 
## 7 Chicago IL        0.00592
swrACFAll_r21 %>% 
    filter(lag>0) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="ACF", title="ACF of seasonally-adjusted shortwave radiation by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

swrACFAll_r21 %>% 
    filter(lag>0, lag<=30) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="ACF", title="ACF of seasonally-adjusted shortwave radiation by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

PACF is calculated and plotted for seasonally-adjusted shortwave radiation for each city:

swrPACF_r21 <- lapply(dewCities, 
                      FUN=function(x) {
                          swrDaily_vs_r21 %>% 
                              filter(fct_city==x) %>% 
                              arrange(date) %>% 
                              select(delta_r21) %>% 
                              pacf(lag.max=2000, plot=FALSE)
                          }
                     )
names(swrPACF_r21) <- dewCities

swrPACFAll_r21 <- map_dfr(.x=dewCities, 
                          .f=function(x) tibble::tibble(lag=as.vector(swrPACF_r21[[x]][["lag"]]), 
                                                        acf=as.vector(swrPACF_r21[[x]][["acf"]])
                                                        ), 
                          .id="src") %>% 
    mutate(cityName=dewCities[as.integer(src)])

swrPACFAll_r21
## # A tibble: 14,000 × 4
##    src     lag      acf cityName 
##    <chr> <dbl>    <dbl> <chr>    
##  1 1         1  0.228   Boston MA
##  2 1         2 -0.0344  Boston MA
##  3 1         3  0.00385 Boston MA
##  4 1         4 -0.0115  Boston MA
##  5 1         5 -0.0106  Boston MA
##  6 1         6 -0.0141  Boston MA
##  7 1         7  0.0122  Boston MA
##  8 1         8 -0.0243  Boston MA
##  9 1         9 -0.0143  Boston MA
## 10 1        10  0.0255  Boston MA
## # ℹ 13,990 more rows
swrPACFAll_r21 %>%
    filter(lag > 0) %>%
    group_by(cityName) %>%
    summarize(mu_abs_acf=mean(abs(acf))) %>%
    arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Houston TX        0.0105 
## 2 Las Vegas NV      0.0102 
## 3 Los Angeles CA    0.0101 
## 4 Miami FL          0.0100 
## 5 Boston MA         0.00993
## 6 New York NY       0.00989
## 7 Chicago IL        0.00515
swrPACFAll_r21 %>% 
    filter(lag>0) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="PACF", title="PACF of seasonally-adjusted shortwave radiation by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

swrPACFAll_r21 %>% 
    filter(lag>0, lag<=30) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="PACF", title="PACF of seasonally-adjusted shortwave radiation by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

ACF is calculated and plotted for seasonally-adjusted temperature for each city:

tempACF_r21 <- lapply(dewCities, 
                      FUN=function(x) {
                          tempDaily_vs_r21 %>% 
                              filter(fct_city==x) %>% 
                              arrange(date) %>% 
                              select(delta_r21) %>% 
                              acf(lag.max=2000, plot=FALSE)
                          }
                      )
names(tempACF_r21) <- dewCities

tempACFAll_r21 <- map_dfr(.x=dewCities, 
                          .f=function(x) tibble::tibble(lag=as.vector(tempACF_r21[[x]][["lag"]]), 
                                                        acf=as.vector(tempACF_r21[[x]][["acf"]])
                                                        ), 
                          .id="src") %>% 
    mutate(cityName=dewCities[as.integer(src)])

tempACFAll_r21
## # A tibble: 14,007 × 4
##    src     lag    acf cityName 
##    <chr> <dbl>  <dbl> <chr>    
##  1 1         0 1      Boston MA
##  2 1         1 0.648  Boston MA
##  3 1         2 0.307  Boston MA
##  4 1         3 0.187  Boston MA
##  5 1         4 0.150  Boston MA
##  6 1         5 0.124  Boston MA
##  7 1         6 0.109  Boston MA
##  8 1         7 0.0873 Boston MA
##  9 1         8 0.0746 Boston MA
## 10 1         9 0.0637 Boston MA
## # ℹ 13,997 more rows
tempACFAll_r21 %>%
    filter(lag > 0) %>%
    group_by(cityName) %>%
    summarize(mu_abs_acf=mean(abs(acf))) %>%
    arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Los Angeles CA     0.0282
## 2 Las Vegas NV       0.0232
## 3 Houston TX         0.0198
## 4 New York NY        0.0191
## 5 Boston MA          0.0174
## 6 Miami FL           0.0173
## 7 Chicago IL         0.0127
tempACFAll_r21 %>% 
    filter(lag>0) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="ACF", title="ACF of seasonally-adjusted mean temperature by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

tempACFAll_r21 %>% 
    filter(lag>0, lag<=30) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="ACF", title="ACF of seasonally-adjusted mean temperature by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

PACF is calculated and plotted for seasonally-adjusted temperature for each city:

tempPACF_r21 <- lapply(dewCities, 
                       FUN=function(x) {
                           tempDaily_vs_r21 %>% 
                               filter(fct_city==x) %>% 
                               arrange(date) %>% 
                               select(delta_r21) %>% 
                               pacf(lag.max=2000, plot=FALSE)
                           }
                       )
names(tempPACF_r21) <- dewCities

tempPACFAll_r21 <- map_dfr(.x=dewCities, 
                           .f=function(x) tibble::tibble(lag=as.vector(tempPACF_r21[[x]][["lag"]]), 
                                                         acf=as.vector(tempPACF_r21[[x]][["acf"]])
                                                         ), 
                           .id="src") %>% 
    mutate(cityName=dewCities[as.integer(src)])

tempPACFAll_r21
## # A tibble: 14,000 × 4
##    src     lag       acf cityName 
##    <chr> <dbl>     <dbl> <chr>    
##  1 1         1  0.648    Boston MA
##  2 1         2 -0.195    Boston MA
##  3 1         3  0.134    Boston MA
##  4 1         4  0.0104   Boston MA
##  5 1         5  0.0248   Boston MA
##  6 1         6  0.0303   Boston MA
##  7 1         7 -0.00303  Boston MA
##  8 1         8  0.0244   Boston MA
##  9 1         9  0.000595 Boston MA
## 10 1        10  0.0201   Boston MA
## # ℹ 13,990 more rows
tempPACFAll_r21 %>%
    filter(lag > 0) %>%
    group_by(cityName) %>%
    summarize(mu_abs_acf=mean(abs(acf))) %>%
    arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Los Angeles CA    0.0108 
## 2 New York NY       0.0107 
## 3 Boston MA         0.0105 
## 4 Las Vegas NV      0.0104 
## 5 Houston TX        0.0104 
## 6 Miami FL          0.0101 
## 7 Chicago IL        0.00577
tempPACFAll_r21 %>% 
    filter(lag>0) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="PACF", title="PACF of seasonally-adjusted mean temperature by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

tempPACFAll_r21 %>% 
    filter(lag>0, lag<=30) %>%
    ggplot(aes(x=lag, y=acf)) + 
    geom_line(aes(color=cityName, group=cityName), lwd=1) + 
    labs(x="Lag (zero excluded)", y="PACF", title="PACF of seasonally-adjusted mean temperature by city") + 
    scale_color_discrete(NULL) + 
    lims(y=c(-1, 1))

The ACF plot creation process is converted to functional form:

makeACFPlot <- function(df, 
                        cities=dewCities, 
                        keyVar="delta_r21", 
                        fn=acf, 
                        lagMax=2000, 
                        lagCuts=list(c(1, +Inf)), 
                        plotTitle="provided data",
                        printSummary=TRUE,
                        makePlots=TRUE,
                        returnData=TRUE
                        ) {

    # FUNCTIOn ARGUMENTS:
    # df: tibble containing data
    # cities: city names for including as colors in plot
    # fn: the function to calculate (should be acf or pacf)
    # lagMax: the maximum lage to use in fn
    # lagCuts: list of tuples for plot x-axis min/max cut points list(c(x1min, x1max), c(x2min, x2max), ...)
    # plotTitle: description for the plot data, will have "fn of " prepended
    # printSummary: boolean, should a summary of mean absolute fn by city be printed?
    # makePlots: boolean, should plots be created?
    # returnData: bollean, should dfACF be returned?
    
    vecACF <- lapply(cities, 
                     FUN=function(x) {
                         df %>% 
                             filter(fct_city==x) %>% 
                             arrange(date) %>% 
                             select(all_of(keyVar)) %>% 
                             fn(lag.max=lagMax, plot=FALSE)
                         }
                     )

    names(vecACF) <- cities

    dfACF <- map_dfr(.x=cities, 
                     .f=function(x) tibble::tibble(lag=as.vector(vecACF[[x]][["lag"]]), 
                                                   acf=as.vector(vecACF[[x]][["acf"]])
                                                   ), 
                     .id="src"
                     ) %>% 
        mutate(cityName=cities[as.integer(src)])

    if(isTRUE(printSummary)) {
        dfACF %>%
            filter(lag > 0) %>%
            group_by(cityName) %>%
            summarize(mu_abs_acf=mean(abs(acf))) %>%
            arrange(desc(mu_abs_acf)) %>%
            print()
    }
    
    if(isTRUE(makePlots)) {
        for(i in 1:length(lagCuts)) {
            xmin <- lagCuts[[i]][1]
            xmax <- lagCuts[[i]][2]
            p1 <- dfACF %>%
                filter(lag>=xmin, lag<=xmax) %>%
                ggplot(aes(x=lag, y=acf)) +
                geom_line(aes(color=cityName, group=cityName), lwd=1) +
                labs(x="Lag", 
                     y=toupper(deparse(substitute(fn))), 
                     title=paste0(toupper(deparse(substitute(fn))), " of ", plotTitle)
                     ) +
                scale_color_discrete(NULL) +
                lims(y=c(-1, 1))
            print(p1)
        }
    }
    
    if(isTRUE(returnData)) return(dfACF)
        
}

The function is tested:

# Base functionality
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21")
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Los Angeles CA     0.0282
## 2 Las Vegas NV       0.0232
## 3 Houston TX         0.0198
## 4 New York NY        0.0191
## 5 Boston MA          0.0174
## 6 Miami FL           0.0173
## 7 Chicago IL         0.0127

## # A tibble: 14,007 × 4
##    src     lag    acf cityName 
##    <chr> <dbl>  <dbl> <chr>    
##  1 1         0 1      Boston MA
##  2 1         1 0.648  Boston MA
##  3 1         2 0.307  Boston MA
##  4 1         3 0.187  Boston MA
##  5 1         4 0.150  Boston MA
##  6 1         5 0.124  Boston MA
##  7 1         6 0.109  Boston MA
##  8 1         7 0.0873 Boston MA
##  9 1         8 0.0746 Boston MA
## 10 1         9 0.0637 Boston MA
## # ℹ 13,997 more rows
# PACF functionality
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", fn=pacf)
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Los Angeles CA    0.0108 
## 2 New York NY       0.0107 
## 3 Boston MA         0.0105 
## 4 Las Vegas NV      0.0104 
## 5 Houston TX        0.0104 
## 6 Miami FL          0.0101 
## 7 Chicago IL        0.00577

## # A tibble: 14,000 × 4
##    src     lag       acf cityName 
##    <chr> <dbl>     <dbl> <chr>    
##  1 1         1  0.648    Boston MA
##  2 1         2 -0.195    Boston MA
##  3 1         3  0.134    Boston MA
##  4 1         4  0.0104   Boston MA
##  5 1         5  0.0248   Boston MA
##  6 1         6  0.0303   Boston MA
##  7 1         7 -0.00303  Boston MA
##  8 1         8  0.0244   Boston MA
##  9 1         9  0.000595 Boston MA
## 10 1        10  0.0201   Boston MA
## # ℹ 13,990 more rows
# Exclude elements of printing/plotting
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", printSummary=FALSE)

## # A tibble: 14,007 × 4
##    src     lag    acf cityName 
##    <chr> <dbl>  <dbl> <chr>    
##  1 1         0 1      Boston MA
##  2 1         1 0.648  Boston MA
##  3 1         2 0.307  Boston MA
##  4 1         3 0.187  Boston MA
##  5 1         4 0.150  Boston MA
##  6 1         5 0.124  Boston MA
##  7 1         6 0.109  Boston MA
##  8 1         7 0.0873 Boston MA
##  9 1         8 0.0746 Boston MA
## 10 1         9 0.0637 Boston MA
## # ℹ 13,997 more rows
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", returnData=FALSE)
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Los Angeles CA     0.0282
## 2 Las Vegas NV       0.0232
## 3 Houston TX         0.0198
## 4 New York NY        0.0191
## 5 Boston MA          0.0174
## 6 Miami FL           0.0173
## 7 Chicago IL         0.0127

makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", makePlots=FALSE)
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Los Angeles CA     0.0282
## 2 Las Vegas NV       0.0232
## 3 Houston TX         0.0198
## 4 New York NY        0.0191
## 5 Boston MA          0.0174
## 6 Miami FL           0.0173
## 7 Chicago IL         0.0127
## # A tibble: 14,007 × 4
##    src     lag    acf cityName 
##    <chr> <dbl>  <dbl> <chr>    
##  1 1         0 1      Boston MA
##  2 1         1 0.648  Boston MA
##  3 1         2 0.307  Boston MA
##  4 1         3 0.187  Boston MA
##  5 1         4 0.150  Boston MA
##  6 1         5 0.124  Boston MA
##  7 1         6 0.109  Boston MA
##  8 1         7 0.0873 Boston MA
##  9 1         8 0.0746 Boston MA
## 10 1         9 0.0637 Boston MA
## # ℹ 13,997 more rows
# Specialized cut points
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", lagCuts=list(c(0, +Inf), c(1, 10)))
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Los Angeles CA     0.0282
## 2 Las Vegas NV       0.0232
## 3 Houston TX         0.0198
## 4 New York NY        0.0191
## 5 Boston MA          0.0174
## 6 Miami FL           0.0173
## 7 Chicago IL         0.0127

## # A tibble: 14,007 × 4
##    src     lag    acf cityName 
##    <chr> <dbl>  <dbl> <chr>    
##  1 1         0 1      Boston MA
##  2 1         1 0.648  Boston MA
##  3 1         2 0.307  Boston MA
##  4 1         3 0.187  Boston MA
##  5 1         4 0.150  Boston MA
##  6 1         5 0.124  Boston MA
##  7 1         6 0.109  Boston MA
##  8 1         7 0.0873 Boston MA
##  9 1         8 0.0746 Boston MA
## 10 1         9 0.0637 Boston MA
## # ℹ 13,997 more rows
# Customized plot titles
makeACFPlot(tempDaily_vs_r21, 
            keyVar="delta_r21", 
            lagCuts=list(c(0, +Inf), c(1, 10)), 
            plotTitle="seasonally-adjusted mean temperature by city"
            )
## # A tibble: 7 × 2
##   cityName       mu_abs_acf
##   <chr>               <dbl>
## 1 Los Angeles CA     0.0282
## 2 Las Vegas NV       0.0232
## 3 Houston TX         0.0198
## 4 New York NY        0.0191
## 5 Boston MA          0.0174
## 6 Miami FL           0.0173
## 7 Chicago IL         0.0127

## # A tibble: 14,007 × 4
##    src     lag    acf cityName 
##    <chr> <dbl>  <dbl> <chr>    
##  1 1         0 1      Boston MA
##  2 1         1 0.648  Boston MA
##  3 1         2 0.307  Boston MA
##  4 1         3 0.187  Boston MA
##  5 1         4 0.150  Boston MA
##  6 1         5 0.124  Boston MA
##  7 1         6 0.109  Boston MA
##  8 1         7 0.0873 Boston MA
##  9 1         8 0.0746 Boston MA
## 10 1         9 0.0637 Boston MA
## # ℹ 13,997 more rows
# Create tibble for multiple elements
map_dfr(.x=c("delta_r21", "temp_r21", "meanTemp"), 
        .f=function(x) makeACFPlot(tempDaily_vs_r21, keyVar=x, printSummary=FALSE, makePlots=FALSE), 
        .id="src"
        ) %>%
    mutate(src=c("delta_r21", "temp_r21", "meanTemp")[as.integer(src)]) %>%
    pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
## # A tibble: 14,007 × 5
##      lag cityName  delta_r21 temp_r21 meanTemp
##    <dbl> <chr>         <dbl>    <dbl>    <dbl>
##  1     0 Boston MA    1         1        1    
##  2     1 Boston MA    0.648     1.000    0.940
##  3     2 Boston MA    0.307     0.999    0.882
##  4     3 Boston MA    0.187     0.998    0.861
##  5     4 Boston MA    0.150     0.996    0.854
##  6     5 Boston MA    0.124     0.994    0.848
##  7     6 Boston MA    0.109     0.992    0.844
##  8     7 Boston MA    0.0873    0.990    0.839
##  9     8 Boston MA    0.0746    0.987    0.834
## 10     9 Boston MA    0.0637    0.985    0.830
## # ℹ 13,997 more rows

Relative ACF by city for mean temperature are plotted:

# Create tibble for multiple elements
dfTempACF <- map_dfr(.x=c("delta_r21", "temp_r21", "meanTemp"), 
        .f=function(x) makeACFPlot(tempDaily_vs_r21, keyVar=x, printSummary=FALSE, makePlots=FALSE), 
        .id="src"
        ) %>%
    mutate(src=c("delta_r21", "temp_r21", "meanTemp")[as.integer(src)]) %>%
    pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfTempACF
## # A tibble: 14,007 × 5
##      lag cityName  delta_r21 temp_r21 meanTemp
##    <dbl> <chr>         <dbl>    <dbl>    <dbl>
##  1     0 Boston MA    1         1        1    
##  2     1 Boston MA    0.648     1.000    0.940
##  3     2 Boston MA    0.307     0.999    0.882
##  4     3 Boston MA    0.187     0.998    0.861
##  5     4 Boston MA    0.150     0.996    0.854
##  6     5 Boston MA    0.124     0.994    0.848
##  7     6 Boston MA    0.109     0.992    0.844
##  8     7 Boston MA    0.0873    0.990    0.839
##  9     8 Boston MA    0.0746    0.987    0.834
## 10     9 Boston MA    0.0637    0.985    0.830
## # ℹ 13,997 more rows
dfTempACF %>%
    pivot_longer(cols=-c(lag, cityName)) %>%
    ggplot(aes(x=lag, y=value)) + 
    geom_line(aes(color=name)) + 
    facet_wrap(~cityName) + 
    labs(x="Lag", y="ACF", title="ACF by mean temperature metric and city") + 
    scale_color_discrete("Metric") + 
    lims(y=c(-1, 1))

Relative PACF by city for mean temperature are plotted:

# Create tibble for multiple elements
dfTempPACF <- map_dfr(.x=c("delta_r21", "temp_r21", "meanTemp"), 
                      .f=function(x) makeACFPlot(tempDaily_vs_r21, 
                                                 fn=pacf, 
                                                 keyVar=x, 
                                                 printSummary=FALSE,
                                                 makePlots=FALSE
                                                 ), 
                      .id="src"
                      ) %>%
    mutate(src=c("delta_r21", "temp_r21", "meanTemp")[as.integer(src)]) %>%
    pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfTempPACF
## # A tibble: 14,000 × 5
##      lag cityName  delta_r21 temp_r21 meanTemp
##    <dbl> <chr>         <dbl>    <dbl>    <dbl>
##  1     1 Boston MA  0.648      1.000    0.940 
##  2     2 Boston MA -0.195     -0.344   -0.0166
##  3     3 Boston MA  0.134     -0.234    0.289 
##  4     4 Boston MA  0.0104    -0.185    0.127 
##  5     5 Boston MA  0.0248    -0.156    0.122 
##  6     6 Boston MA  0.0303    -0.123    0.111 
##  7     7 Boston MA -0.00303   -0.112    0.0658
##  8     8 Boston MA  0.0244    -0.0970   0.0823
##  9     9 Boston MA  0.000595  -0.0844   0.0481
## 10    10 Boston MA  0.0201    -0.0890   0.0575
## # ℹ 13,990 more rows
dfTempPACF %>%
    pivot_longer(cols=-c(lag, cityName)) %>%
    ggplot(aes(x=lag, y=value)) + 
    geom_line(aes(color=name)) + 
    facet_wrap(~cityName) + 
    labs(x="Lag", y="PACF", title="PACF by mean temperature metric and city") + 
    scale_color_discrete("Metric") + 
    lims(y=c(-1, 1))

dfTempPACF %>%
    pivot_longer(cols=-c(lag, cityName)) %>%
    filter(lag <= 20) %>%
    ggplot(aes(x=lag, y=value)) + 
    geom_line(aes(color=name)) + 
    facet_wrap(~cityName) + 
    labs(x="Lag", y="PACF", title="PACF by mean temperature metric and city") + 
    scale_color_discrete("Metric") + 
    lims(y=c(-1, 1))

Relative ACF by city for mean dewpoint are plotted:

# Create tibble for multiple elements
dfDewACF <- map_dfr(.x=c("delta_r21", "dewpoint_r21", "dewpoint_2m_mean"), 
                    .f=function(x) makeACFPlot(dewPDaily_vs_r21, keyVar=x, printSummary=FALSE, makePlots=FALSE), 
                    .id="src"
                    ) %>%
    mutate(src=c("delta_r21", "dewpoint_r21", "dewpoint_2m_mean")[as.integer(src)]) %>%
    pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfDewACF
## # A tibble: 14,007 × 5
##      lag cityName  delta_r21 dewpoint_r21 dewpoint_2m_mean
##    <dbl> <chr>         <dbl>        <dbl>            <dbl>
##  1     0 Boston MA    1             1                1    
##  2     1 Boston MA    0.556         1.000            0.882
##  3     2 Boston MA    0.163         0.999            0.778
##  4     3 Boston MA    0.0744        0.998            0.754
##  5     4 Boston MA    0.0636        0.996            0.750
##  6     5 Boston MA    0.0631        0.994            0.749
##  7     6 Boston MA    0.0717        0.992            0.750
##  8     7 Boston MA    0.0732        0.990            0.749
##  9     8 Boston MA    0.0519        0.987            0.741
## 10     9 Boston MA    0.0205        0.984            0.731
## # ℹ 13,997 more rows
dfDewACF %>%
    pivot_longer(cols=-c(lag, cityName)) %>%
    ggplot(aes(x=lag, y=value)) + 
    geom_line(aes(color=name)) + 
    facet_wrap(~cityName) + 
    labs(x="Lag", y="ACF", title="ACF by mean dewpoint metric and city") + 
    scale_color_discrete("Metric") + 
    lims(y=c(-1, 1))

Relative PACF by city for mean dewpoint are plotted:

# Create tibble for multiple elements
dfDewPACF <- map_dfr(.x=c("delta_r21", "dewpoint_r21", "dewpoint_2m_mean"), 
                     .f=function(x) makeACFPlot(dewPDaily_vs_r21, 
                                                fn=pacf, 
                                                keyVar=x, 
                                                printSummary=FALSE,
                                                makePlots=FALSE
                                                ), 
                     .id="src"
                     ) %>%
    mutate(src=c("delta_r21", "dewpoint_r21", "dewpoint_2m_mean")[as.integer(src)]) %>%
    pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfDewPACF
## # A tibble: 14,000 × 5
##      lag cityName  delta_r21 dewpoint_r21 dewpoint_2m_mean
##    <dbl> <chr>         <dbl>        <dbl>            <dbl>
##  1     1 Boston MA   0.556         1.000           0.882  
##  2     2 Boston MA  -0.211        -0.364          -0.00177
##  3     3 Boston MA   0.124        -0.220           0.306  
##  4     4 Boston MA  -0.0143       -0.178           0.127  
##  5     5 Boston MA   0.0413       -0.162           0.158  
##  6     6 Boston MA   0.0307       -0.124           0.126  
##  7     7 Boston MA   0.0236       -0.114           0.103  
##  8     8 Boston MA  -0.00249      -0.0968          0.0677 
##  9     9 Boston MA  -0.0108       -0.0872          0.0515 
## 10    10 Boston MA   0.0188       -0.0878          0.0707 
## # ℹ 13,990 more rows
dfDewPACF %>%
    pivot_longer(cols=-c(lag, cityName)) %>%
    ggplot(aes(x=lag, y=value)) + 
    geom_line(aes(color=name)) + 
    facet_wrap(~cityName) + 
    labs(x="Lag", y="PACF", title="PACF by mean dewpoint metric and city") + 
    scale_color_discrete("Metric") + 
    lims(y=c(-1, 1))

dfDewPACF %>%
    pivot_longer(cols=-c(lag, cityName)) %>%
    filter(lag <= 20) %>%
    ggplot(aes(x=lag, y=value)) + 
    geom_line(aes(color=name)) + 
    facet_wrap(~cityName) + 
    labs(x="Lag", y="PACF", title="PACF by mean dewpoint metric and city") + 
    scale_color_discrete("Metric") + 
    lims(y=c(-1, 1))

Relative ACF by city for maximum wind speed are plotted:

keepVars <- c("delta_r21", "wind_r21", "maxWind")
useDF <- windDaily_vs_r21

# Create tibble for multiple elements
dfWindACF <- map_dfr(.x=keepVars, 
                     .f=function(x) makeACFPlot(useDF, keyVar=x, printSummary=FALSE, makePlots=FALSE), 
                     .id="src"
                     ) %>%
    mutate(src=keepVars[as.integer(src)]) %>%
    pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfWindACF
## # A tibble: 14,007 × 5
##      lag cityName   delta_r21 wind_r21 maxWind
##    <dbl> <chr>          <dbl>    <dbl>   <dbl>
##  1     0 Boston MA  1            1      1     
##  2     1 Boston MA  0.310        0.998  0.377 
##  3     2 Boston MA -0.00134      0.995  0.0951
##  4     3 Boston MA -0.0171       0.992  0.0804
##  5     4 Boston MA -0.0105       0.988  0.0855
##  6     5 Boston MA  0.00550      0.984  0.100 
##  7     6 Boston MA  0.00163      0.981  0.0973
##  8     7 Boston MA -0.0000671    0.977  0.0968
##  9     8 Boston MA  0.00612      0.972  0.103 
## 10     9 Boston MA  0.0000161    0.967  0.0971
## # ℹ 13,997 more rows
dfWindACF %>%
    pivot_longer(cols=-c(lag, cityName)) %>%
    ggplot(aes(x=lag, y=value)) + 
    geom_line(aes(color=name)) + 
    facet_wrap(~cityName) + 
    labs(x="Lag", y="ACF", title="ACF by mean windspeed metric and city") + 
    scale_color_discrete("Metric") + 
    lims(y=c(-1, 1))

Relative PACF by city for mean maximum wind speed are plotted:

keepVars <- c("delta_r21", "wind_r21", "maxWind")
useDF <- windDaily_vs_r21

# Create tibble for multiple elements
dfWindPACF <- map_dfr(.x=keepVars, 
                      .f=function(x) makeACFPlot(useDF, 
                                                 fn=pacf, 
                                                 keyVar=x, 
                                                 printSummary=FALSE,
                                                 makePlots=FALSE
                                                 ), 
                      .id="src"
                      ) %>%
    mutate(src=keepVars[as.integer(src)]) %>%
    pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfWindACF
## # A tibble: 14,007 × 5
##      lag cityName   delta_r21 wind_r21 maxWind
##    <dbl> <chr>          <dbl>    <dbl>   <dbl>
##  1     0 Boston MA  1            1      1     
##  2     1 Boston MA  0.310        0.998  0.377 
##  3     2 Boston MA -0.00134      0.995  0.0951
##  4     3 Boston MA -0.0171       0.992  0.0804
##  5     4 Boston MA -0.0105       0.988  0.0855
##  6     5 Boston MA  0.00550      0.984  0.100 
##  7     6 Boston MA  0.00163      0.981  0.0973
##  8     7 Boston MA -0.0000671    0.977  0.0968
##  9     8 Boston MA  0.00612      0.972  0.103 
## 10     9 Boston MA  0.0000161    0.967  0.0971
## # ℹ 13,997 more rows
dfWindPACF %>%
    pivot_longer(cols=-c(lag, cityName)) %>%
    ggplot(aes(x=lag, y=value)) + 
    geom_line(aes(color=name)) + 
    facet_wrap(~cityName) + 
    labs(x="Lag", y="PACF", title="PACF by maximum wind speed metric and city") + 
    scale_color_discrete("Metric") + 
    lims(y=c(-1, 1))

dfWindPACF %>%
    pivot_longer(cols=-c(lag, cityName)) %>%
    filter(lag <= 20) %>%
    ggplot(aes(x=lag, y=value)) + 
    geom_line(aes(color=name)) + 
    facet_wrap(~cityName) + 
    labs(x="Lag", y="PACF", title="PACF by maximum wind speed metric and city") + 
    scale_color_discrete("Metric") + 
    lims(y=c(-1, 1))

Rolling 21-day mean maximum wind speed is further explored:

windDaily_vs_r21 %>%
    filter(year(date)>2010) %>%
    ggplot(aes(x=date, y=wind_r21)) +
    geom_line() + 
    geom_smooth(aes(y=maxWind), color="lightblue", method="loess", span=21/365) +
    facet_wrap(~fct_city) + 
    labs(x=NULL, 
         y="Rolling 21-day maximum wind speed", 
         caption="Black: Mean based on city and day-of-year\nBlue: Smooth on raw data"
         )
## `geom_smooth()` using formula = 'y ~ x'

dfWindPACF %>%
    filter(lag>=100) %>%
    mutate(abs_wind_r21=abs(wind_r21)) %>%
    group_by(cityName) %>%
    slice_max(abs_wind_r21, n=6) %>%
    print(n=50)
## # A tibble: 42 × 6
## # Groups:   cityName [7]
##      lag cityName        delta_r21 wind_r21  maxWind abs_wind_r21
##    <dbl> <chr>               <dbl>    <dbl>    <dbl>        <dbl>
##  1   366 Boston MA       0.0253      -0.402  0.0298         0.402
##  2   345 Boston MA       0.0432       0.271  0.0446         0.271
##  3  1097 Boston MA      -0.0227      -0.165 -0.0184         0.165
##  4  1462 Boston MA      -0.00132     -0.162  0.00106        0.162
##  5   367 Boston MA      -0.0123      -0.150 -0.00806        0.150
##  6   732 Boston MA       0.0158      -0.117  0.0191         0.117
##  7   366 Chicago IL      0.00214     -0.413  0.0101         0.413
##  8  1462 Chicago IL     -0.0139      -0.376 -0.0106         0.376
##  9   345 Chicago IL      0.00555      0.304  0.0150         0.304
## 10  1097 Chicago IL     -0.00303     -0.204  0.00194        0.204
## 11   367 Chicago IL      0.000115    -0.182  0.00780        0.182
## 12  1441 Chicago IL     -0.00604      0.174 -0.00356        0.174
## 13   366 Houston TX      0.00345     -0.369  0.00925        0.369
## 14   345 Houston TX      0.00796      0.327  0.0135         0.327
## 15  1097 Houston TX      0.0307      -0.156  0.0342         0.156
## 16   387 Houston TX      0.00977     -0.142  0.00469        0.142
## 17   344 Houston TX      0.00636      0.140  0.0118         0.140
## 18   106 Houston TX      0.00569      0.136 -0.00385        0.136
## 19   366 Las Vegas NV   -0.00407     -0.338 -0.00196        0.338
## 20   345 Las Vegas NV   -0.0137       0.178 -0.0124         0.178
## 21  1462 Las Vegas NV   -0.0124      -0.153 -0.00699        0.153
## 22   732 Las Vegas NV   -0.00980     -0.127 -0.00974        0.127
## 23  1097 Las Vegas NV   -0.0113      -0.126 -0.00796        0.126
## 24   367 Las Vegas NV   -0.00594     -0.123 -0.00416        0.123
## 25   366 Los Angeles CA -0.00364     -0.412 -0.00117        0.412
## 26   345 Los Angeles CA -0.000526     0.271  0.00165        0.271
## 27  1097 Los Angeles CA -0.00697     -0.160 -0.00449        0.160
## 28   387 Los Angeles CA -0.0177      -0.154 -0.0174         0.154
## 29  1462 Los Angeles CA -0.0178      -0.145 -0.0148         0.145
## 30  1463 Los Angeles CA  0.0138       0.136  0.0177         0.136
## 31   366 Miami FL       -0.0122      -0.402 -0.00415        0.402
## 32   345 Miami FL       -0.0000806    0.222  0.00726        0.222
## 33  1462 Miami FL       -0.0185      -0.165 -0.0150         0.165
## 34  1097 Miami FL       -0.00502     -0.153 -0.00124        0.153
## 35   344 Miami FL        0.000391     0.139  0.00744        0.139
## 36  1463 Miami FL       -0.0120       0.115 -0.00805        0.115
## 37   366 New York NY     0.00434     -0.349  0.0175         0.349
## 38   345 New York NY     0.0299       0.318  0.0417         0.318
## 39  1097 New York NY     0.00396     -0.137  0.0104         0.137
## 40  1462 New York NY    -0.00945     -0.137 -0.00441        0.137
## 41   732 New York NY    -0.000116    -0.126  0.00612        0.126
## 42   346 New York NY    -0.00847      0.124  0.00292        0.124

There is a consistently large PACF at lag 366. Some cities appear to have a disconnected increase in mean maximum wind speed

Mean maximum wind speed by year is further explored:

windDaily_vs_r21 %>%
    filter(year(date)>2010) %>%
    group_by(fct_city, year) %>%
    summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
    ggplot(aes(x=year, y=muWind)) +
    geom_line() + 
    geom_line(aes(y=mur21), lty=2) +
    facet_wrap(~fct_city) + 
    labs(x=NULL, 
         y="Mean maximum wind speed", 
         caption="Solid: Mean based on city and year\nDashed: Mean based on city"
         )

The impact is particularly pronounced in Chicago and Las Vegas, with some visual impact also in Houston

The impact for Las Vegas is further explored, including the disconnect in 2017:

windDaily_vs_r21 %>%
    filter(year(date)>2010, fct_city=="Las Vegas NV") %>%
    mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
    group_by(fct_city, year, mon) %>%
    summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
    group_by(mon) %>%
    mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE)) %>%
    ungroup() %>%
    ggplot(aes(x=factor(mon), y=muWind)) + 
    geom_line(aes(group=1), lwd=1.5, color="lightblue") + 
    geom_line(aes(y=multe2016, group=1), lty=2) +
    facet_wrap(~year) + 
    labs(title="Las Vegas NV", 
         x=NULL, 
         y="Mean maximum wind speed", 
         caption="Solid: Mean based on month and year\nDashed: Mean by month (2011-2016 avg)"
         )

There is likely a significant change in measuring site, methodology, etc. - effective January 2017 - for mean maximum wind speed in Las Vegas

The impact for Chicago is further explored, including the disconnect in 2017:

windDaily_vs_r21 %>%
    filter(year(date)>2010, fct_city=="Chicago IL") %>%
    mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
    group_by(fct_city, year, mon) %>%
    summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
    group_by(mon) %>%
    mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE)) %>%
    ungroup() %>%
    ggplot(aes(x=factor(mon), y=muWind)) + 
    geom_line(aes(group=1), lwd=1.5, color="lightblue") + 
    geom_line(aes(y=multe2016, group=1), lty=2) +
    facet_wrap(~year) + 
    labs(title="Chicago IL", 
         x=NULL, 
         y="Mean maximum wind speed", 
         caption="Solid: Mean based on month and year\nDashed: Mean by month (2011-2016 avg)"
         )

There is likely a significant change in measuring site, methodology, etc. - effective January or February 2017 - for mean maximum wind speed in Chicago

Differences by month between pre/post-2017 are explored:

windDaily_vs_r21 %>%
    filter(year(date)>2010) %>%
    mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
    group_by(fct_city, year, mon) %>%
    summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
    group_by(fct_city, mon) %>%
    mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE), 
           mugt2016=mean(ifelse(year>2016, muWind, NA), na.rm=TRUE)
           ) %>%
    ungroup() %>%
    filter(year %in% c(2016)) %>%
    select(fct_city, mon, multe2016, mugt2016) %>%
    pivot_longer(cols=-c(fct_city, mon)) %>%
    ggplot(aes(x=factor(mon), 
               y=value, 
               group=name, 
               color=c("mugt2016"="2017-after", "multe2016"="2016-before")[name])
           ) + 
    geom_line() + 
    facet_wrap(~fct_city) + 
    scale_color_discrete(NULL) +
    labs(title="Mean maximum wind speed by city and month\n(2011-2016 vs. 2017-2024)", 
         x=NULL, 
         y="Mean maximum wind speed"
         )

Standard deviation for pre-2017 is added:

windDaily_vs_r21 %>%
    filter(year(date)>2010) %>%
    mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
    group_by(fct_city, year, mon) %>%
    summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
    group_by(fct_city, mon) %>%
    mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE), 
           sdlte2016=sd(ifelse(year<=2016, muWind, NA), na.rm=TRUE), 
           mugt2016=mean(ifelse(year>2016, muWind, NA), na.rm=TRUE)
           ) %>%
    ungroup() %>%
    filter(year %in% c(2016)) %>%
    select(fct_city, mon, multe2016, mugt2016, sdlte2016) %>%
    pivot_longer(cols=-c(fct_city, mon, sdlte2016)) %>%
    ggplot(aes(x=factor(mon), 
               y=value, 
               group=name, 
               color=c("mugt2016"="2017-after", "multe2016"="2016-before")[name])
           ) + 
    geom_line() + 
    geom_ribbon(data=~filter(., name=="multe2016"), 
                aes(ymin=value-sdlte2016, ymax=value+sdlte2016), 
                alpha=0.25, 
                lty=0, 
                fill="red"
                ) +
    facet_wrap(~fct_city) + 
    scale_color_discrete(NULL) +
    labs(title="Mean maximum wind speed by city and month\n(2011-2016 vs. 2017-2024)", 
         x=NULL, 
         y="Mean maximum wind speed", 
         caption="Ribbon for 2016-before is mean +/- 1 sd"
         )

The disconnect in Las Vegas is further explored:

windDaily_vs_r21 %>%
    filter(year(date)>2010, fct_city=="Las Vegas NV") %>%
    mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
    group_by(fct_city, year, mon) %>%
    summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
    group_by(mon) %>%
    mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE), 
           lbl=ifelse(year<=2016, "2011-2016", "2017-20247")
           ) %>%
    ungroup() %>%
    ggplot(aes(x=factor(year), y=muWind)) + 
    geom_col(aes(fill=lbl), alpha=0.5) +
    geom_line(aes(y=multe2016, group=1), lty=2) + 
    facet_wrap(~mon) + 
    labs(title="Las Vegas NV", 
         x=NULL, 
         y="Mean maximum wind speed", 
         caption="Dotted line is monthly mean pre-2017"
         ) + 
    theme(axis.text.x=element_text(angle=90), legend.position="bottom") + 
    scale_fill_discrete(NULL)

The disconnect in Chicago is further explored:

windDaily_vs_r21 %>%
    filter(year(date)>2010, fct_city=="Chicago IL") %>%
    mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
    group_by(fct_city, year, mon) %>%
    summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
    group_by(mon) %>%
    mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE), 
           lbl=ifelse(year<=2016, "2011-2016", "2017-20247")
           ) %>%
    ungroup() %>%
    ggplot(aes(x=factor(year), y=muWind)) + 
    geom_col(aes(fill=lbl), alpha=0.5) +
    geom_line(aes(y=multe2016, group=1), lty=2) + 
    facet_wrap(~mon) + 
    labs(title="Chicago IL", 
         x=NULL, 
         y="Mean maximum wind speed", 
         caption="Dotted line is monthly mean pre-2017"
         ) + 
    theme(axis.text.x=element_text(angle=90), legend.position="bottom") + 
    scale_fill_discrete(NULL)

The pre/post-2017 jump in mean maximum wind speed is much sharper in Las Vegas than in Chcago

Mean temperature by year is explored:

tempDaily_vs_r21 %>%
    filter(year(date)>2010, year(date)<2023) %>%
    group_by(fct_city, year) %>%
    summarize(muTemp=mean(meanTemp), mur21=mean(temp_r21), mudelta=mean(delta_r21), .groups="drop") %>%
    ggplot(aes(x=year, y=muTemp)) +
    geom_line() + 
    geom_line(aes(y=mur21), lty=2) +
    facet_wrap(~fct_city) + 
    labs(x=NULL, 
         y="Mean temperature", 
         caption="Solid: Mean based on city and year\nDashed: Mean based on city"
         )

Temperature by year appears much more stable, with no obvious disconnect in 2017

The process is converted to functional form:

tmpMetricByYear <- function(df, yearMin=-Inf, yearMax=+Inf, mainVar, dashVar, labTitle) {

    p1 <- df %>%
        filter(year(date)>=yearMin, year(date)<=yearMax) %>%
        group_by(fct_city, year) %>%
        summarize(mv=mean(get(mainVar)), dv=mean(get(dashVar)), .groups="drop") %>%
        ggplot(aes(x=year, y=mv)) +
        geom_line() + 
        geom_line(aes(y=dv), lty=2) +
        facet_wrap(~fct_city) + 
        labs(x=NULL, 
             y=labTitle, 
             caption="Solid: Mean based on city and year\nDashed: Mean based on city"
             )
    p1
    
}

The function is tested on temperature:

tmpMetricByYear(tempDaily_vs_r21, 
                yearMin=2011, 
                yearMax=2022, 
                mainVar="meanTemp", 
                dashVar="temp_r21", 
                labTitle="Mean Temperature"
                )

The function is tested on wind:

tmpMetricByYear(windDaily_vs_r21, 
                yearMin=2011, 
                yearMax=2022, 
                mainVar="maxWind", 
                dashVar="wind_r21", 
                labTitle="Mean maximum wind speed"
                )

The function is run on dewpoint:

tmpMetricByYear(dewPDaily_vs_r21, 
                yearMin=2011, 
                yearMax=2022, 
                mainVar="dewpoint_2m_mean", 
                dashVar="dewpoint_r21", 
                labTitle="Mean dewpoint"
                )

The function is run on precipitation:

tmpMetricByYear(precipDaily_vs_r21, 
                yearMin=2011, 
                yearMax=2022, 
                mainVar="sumPrecip", 
                dashVar="precip_r21", 
                labTitle="Mean precipitation"
                )

The function is run on shortwave radiation:

tmpMetricByYear(swrDaily_vs_r21, 
                yearMin=2011, 
                yearMax=2022, 
                mainVar="sumSWR", 
                dashVar="swr_r21", 
                labTitle="Mean shortwave radiation"
                )

A function is written to include year and month:

tmpMetricByMonthYear <- function(df, yearMin=-Inf, yearMax=+Inf, mainVar, dashVar, labTitle) {

    p1 <- df %>%
        filter(year(date)>=yearMin, year(date)<=yearMax) %>%
        mutate(mon=month(date), mmyy=make_date(year=year, month=mon)) %>%
        group_by(fct_city, year, mon, mmyy) %>%
        summarize(mv=mean(get(mainVar)), .groups="drop") %>%
        group_by(fct_city, mon) %>%
        mutate(dv=mean(mv)) %>%
        ungroup() %>%
        ggplot(aes(x=mmyy, y=mv)) +
        geom_line() + 
        geom_line(aes(y=dv), lty=2, color="red") +
        facet_wrap(~fct_city) + 
        labs(x=NULL, 
             y=labTitle, 
             caption="Black solid: Mean based on city and month-year\nRed dashed: Mean based on city and month"
             )
    p1
    
}

The function is tested on temperature:

tmpMetricByMonthYear(tempDaily_vs_r21, 
                     yearMin=2011, 
                     yearMax=2022, 
                     mainVar="meanTemp", 
                     dashVar="temp_r21", 
                     labTitle="Mean Temperature"
                     )

The function is tested on wind speed:

tmpMetricByMonthYear(windDaily_vs_r21, 
                     yearMin=2011, 
                     yearMax=2022, 
                     mainVar="maxWind", 
                     dashVar="wind_r21", 
                     labTitle="Mean Maximum Wind Speed"
                     )

The function is tested on dewpoint:

tmpMetricByMonthYear(dewPDaily_vs_r21, 
                     yearMin=2011, 
                     yearMax=2022, 
                     mainVar="dewpoint_2m_mean", 
                     dashVar="dewpoint_r21", 
                     labTitle="Dewpoint"
                     )

The function is tested on precipitation:

tmpMetricByMonthYear(precipDaily_vs_r21, 
                     yearMin=2011, 
                     yearMax=2022, 
                     mainVar="sumPrecip", 
                     dashVar="precip_r21", 
                     labTitle="Precipitation"
                     )

The function is tested on shortwave radiation:

tmpMetricByMonthYear(swrDaily_vs_r21, 
                     yearMin=2011, 
                     yearMax=2022, 
                     mainVar="sumSWR", 
                     dashVar="swr_r21", 
                     labTitle="Shortwave radiation"
                     )

A random forest model is created to estimate snowfall based on precipitation and temperature:

# Use only 50% as training data
set.seed(26013016)
idxTrain_v1 <- sample(1:nrow(allCityDaily), round(0.5*nrow(allCityDaily)), replace=FALSE)

rfTestSnow <- runFullRF(dfTrain=allCityDaily[idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ) %>%
                            select(snowfall_sum, 
                                   temperature_2m_max,
                                   temperature_2m_min, 
                                   precipitation_sum, 
                                   fct_city, 
                                   fct_mon
                                   ),
                        dfTest=allCityDaily[-idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ),
                        yVar="snowfall_sum", 
                        xVars=c("temperature_2m_max", 
                                "temperature_2m_min", 
                                "precipitation_sum", 
                                "fct_city", 
                                "fct_mon"
                                ),
                        isContVar=TRUE,
                        rndTo=-1L,
                        returnData=TRUE
                        )

## 
## R-squared of test data is: 80.795% (RMSE 0.42 vs. 0.96 null)
## `geom_smooth()` using formula = 'y ~ x'

rfTestSnow$tstPred %>%
    group_by(fct_city, fct_mon) %>%
    summarize(n=n(), 
              tssActual=sum((snowfall_sum-mean(snowfall_sum))**2), 
              rssPred=sum((snowfall_sum-pred)**2), 
              r2=1-rssPred/tssActual, 
              rmse=sqrt(rssPred/n), 
              .groups="drop"
              ) %>%
    ggplot(aes(x=fct_city, y=fct_mon)) + 
    geom_tile(aes(fill=r2)) + 
    geom_text(aes(label=round(r2, 3)), size=3) +
    scale_fill_continuous("R2", low="white", high="lightgreen", limits=c(0, 1)) + 
    labs(x=NULL, y=NULL, title="R2 for snowfall predictions")
## Warning: Removed 15 rows containing missing values or values outside the scale range
## (`geom_text()`).

A random forest model is created to estimate precipitation based on month, temperature, and windspeed:

# Use only 50% as training data
set.seed(26013116)
idxTrain_v1 <- sample(1:nrow(allCityDaily), round(0.5*nrow(allCityDaily)), replace=FALSE)

rfTestRain <- runFullRF(dfTrain=allCityDaily[idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ) %>%
                            select(precipitation_sum, 
                                   temperature_2m_max,
                                   temperature_2m_min, 
                                   windspeed_10m_max, 
                                   fct_city, 
                                   fct_mon
                                   ),
                        dfTest=allCityDaily[-idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ),
                        yVar="precipitation_sum", 
                        xVars=c("temperature_2m_max", 
                                "temperature_2m_min", 
                                "windspeed_10m_max", 
                                "fct_city", 
                                "fct_mon"
                                ),
                        isContVar=TRUE,
                        rndTo=-1L,
                        returnData=TRUE
                        )

## 
## R-squared of test data is: 22.741% (RMSE 5.89 vs. 6.71 null)
## `geom_smooth()` using formula = 'y ~ x'

rfTestRain$tstPred %>%
    group_by(fct_city, fct_mon) %>%
    summarize(n=n(), 
              tssActual=sum((precipitation_sum-mean(precipitation_sum))**2), 
              rssPred=sum((precipitation_sum-pred)**2), 
              r2=1-rssPred/tssActual, 
              rmse=sqrt(rssPred/n), 
              .groups="drop"
              ) %>%
    ggplot(aes(x=fct_city, y=fct_mon)) + 
    geom_tile(aes(fill=r2)) + 
    geom_text(aes(label=round(r2, 3)), size=3) +
    scale_fill_continuous("R2", low="white", high="lightgreen", limits=c(0, 1)) + 
    labs(x=NULL, y=NULL, title="R2 for precipitation predictions")

The random forest precipitation model is run for a single city:

# Use only 50% as training data
set.seed(26014115)
idxTrain_v1 <- sample(1:nrow(allCityDaily), round(0.5*nrow(allCityDaily)), replace=FALSE)

rfTestRain <- runFullRF(dfTrain=allCityDaily[idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ) %>%
                            select(precipitation_sum, 
                                   temperature_2m_max,
                                   temperature_2m_min, 
                                   windspeed_10m_max, 
                                   fct_city, 
                                   fct_mon
                                   ) %>%
                            filter(fct_city=="Chicago IL"),
                        dfTest=allCityDaily[-idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ),
                        yVar="precipitation_sum", 
                        xVars=c("temperature_2m_max", 
                                "temperature_2m_min", 
                                "windspeed_10m_max", 
                                "fct_mon"
                                ),
                        isContVar=TRUE,
                        rndTo=-1L,
                        returnData=TRUE
                        )

## 
## R-squared of test data is: 7.594% (RMSE 6.42 vs. 6.68 null)
## `geom_smooth()` using formula = 'y ~ x'

rfTestRain$tstPred %>%
    group_by(fct_city, fct_mon) %>%
    summarize(n=n(), 
              tssActual=sum((precipitation_sum-mean(precipitation_sum))**2), 
              rssPred=sum((precipitation_sum-pred)**2), 
              r2=1-rssPred/tssActual, 
              rmse=sqrt(rssPred/n), 
              .groups="drop"
              ) %>%
    ggplot(aes(x=fct_city, y=fct_mon)) + 
    geom_tile(aes(fill=r2)) + 
    geom_text(aes(label=round(r2, 3)), size=3) +
    scale_fill_continuous("R2", low="white", high="lightgreen", limits=c(0, 1)) + 
    labs(x=NULL, y=NULL, title="R2 for precipitation predictions")

The single-city model performs very poorly on other cities, and only somewhat better on itself

The random forest precipitation model is run for a different city:

# Use only 50% as training data
set.seed(26020115)
idxTrain_v1 <- sample(1:nrow(allCityDaily), round(0.5*nrow(allCityDaily)), replace=FALSE)

rfTestRain <- runFullRF(dfTrain=allCityDaily[idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ) %>%
                            select(precipitation_sum, 
                                   temperature_2m_max,
                                   temperature_2m_min, 
                                   windspeed_10m_max, 
                                   fct_city, 
                                   fct_mon
                                   ) %>%
                            filter(fct_city=="Houston TX"),
                        dfTest=allCityDaily[-idxTrain_v1, ] %>% 
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ),
                        yVar="precipitation_sum", 
                        xVars=c("temperature_2m_max", 
                                "temperature_2m_min", 
                                "windspeed_10m_max", 
                                "fct_mon"
                                ),
                        isContVar=TRUE,
                        rndTo=-1L,
                        returnData=TRUE
                        )

## 
## R-squared of test data is: -28.501% (RMSE 7.67 vs. 6.77 null)
## `geom_smooth()` using formula = 'y ~ x'

rfTestRain$tstPred %>%
    group_by(fct_city, fct_mon) %>%
    summarize(n=n(), 
              tssActual=sum((precipitation_sum-mean(precipitation_sum))**2), 
              rssPred=sum((precipitation_sum-pred)**2), 
              r2=1-rssPred/tssActual, 
              rmse=sqrt(rssPred/n), 
              .groups="drop"
              ) %>%
    ggplot(aes(x=fct_city, y=fct_mon)) + 
    geom_tile(aes(fill=r2)) + 
    geom_text(aes(label=round(r2, 3)), size=3) +
    scale_fill_continuous("R2", low="white", high="lightgreen", limits=c(0, 1)) + 
    labs(x=NULL, y=NULL, title="R2 for precipitation predictions")

The additional single-city model also performs very poorly on other cities, and only somewhat better on itself

A random forest model is created to estimate city based on month, precipitation, temperature, and windspeed:

# Use only 50% as training data
set.seed(26020315)
idxTrain_v1 <- sample(1:nrow(allCityDaily), round(0.5*nrow(allCityDaily)), replace=FALSE)

rfTestCity <- runFullRF(dfTrain=allCityDaily[idxTrain_v1, ] %>% 
                            filter(year(date)>2010, year(date)<2023) %>%
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ) %>%
                            select(precipitation_sum, 
                                   temperature_2m_max,
                                   temperature_2m_min, 
                                   windspeed_10m_max, 
                                   fct_city, 
                                   fct_mon
                                   ),
                        dfTest=allCityDaily[-idxTrain_v1, ] %>% 
                            filter(year(date)>2010, year(date)<2023) %>%
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ),
                        yVar="fct_city", 
                        xVars=c("temperature_2m_max", 
                                "temperature_2m_min", 
                                "windspeed_10m_max", 
                                "precipitation_sum", 
                                "fct_mon"
                                ),
                        isContVar=FALSE,
                        rndTo=-1L,
                        returnData=TRUE
                        )

## 
## Accuracy of test data is: 57.231%

rfTestCity$tstPred %>%
    count(fct_city, pred) %>%
    group_by(fct_city) %>%
    mutate(pct=n/sum(n)) %>%
    ungroup() %>%
    ggplot(aes(x=pred, y=fct_city)) + 
    geom_tile(aes(fill=pct)) + 
    geom_text(aes(label=paste0(n, "\n", 100*round(pct, 3), "%")), size=2.5) +
    scale_fill_continuous("% Pred", low="white", high="lightgreen", limits=c(0, 1)) + 
    labs(x="Predicted", 
         y="Actual", 
         title="Predictions by actual city", 
         caption="Denominator is actual city"
         )

The random forest model to estimate city is updated:

# Use only 50% as training data
set.seed(26020415)
idxTrain_v1 <- sample(1:nrow(allCityDaily), round(0.5*nrow(allCityDaily)), replace=FALSE)
tgtVar=c("fct_city")
predVars=c("temperature_2m_max", 
           "apparent_temperature_max", 
           "temperature_2m_min",
           "apparent_temperature_min", 
           "windspeed_10m_max", 
           "precipitation_sum",
           "precipitation_hours",
           "snowfall_sum", 
           "fct_mon"
           )
                        
rfTestCity <- runFullRF(dfTrain=allCityDaily[idxTrain_v1, ] %>% 
                            filter(year(date)>2010, year(date)<2023) %>%
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ) %>%
                            select(all_of(c(tgtVar, predVars))),
                        dfTest=allCityDaily[-idxTrain_v1, ] %>% 
                            filter(year(date)>2010, year(date)<2023) %>%
                            mutate(fct_city=factor(cityName), 
                                   fct_mon=factor(month.abb[month(date)], levels=month.abb)
                                   ),
                        yVar=tgtVar, 
                        xVars=predVars,
                        isContVar=FALSE,
                        rndTo=-1L,
                        returnData=TRUE
                        )

## 
## Accuracy of test data is: 62.199%

rfTestCity$tstPred %>%
    count(fct_city, pred) %>%
    group_by(fct_city) %>%
    mutate(pct=n/sum(n)) %>%
    ungroup() %>%
    ggplot(aes(x=pred, y=fct_city)) + 
    geom_tile(aes(fill=pct)) + 
    geom_text(aes(label=paste0(n, "\n", 100*round(pct, 3), "%")), size=2.5) +
    scale_fill_continuous("% Pred", low="white", high="lightgreen", limits=c(0, 1)) + 
    labs(x="Predicted", 
         y="Actual", 
         title="Predictions by actual city", 
         caption="Denominator is actual city"
         )